n_jobs > 1 in sklearn causes model fit to hang indefinitely, but only in Rstudio using reticulate #517

Open ryankarel opened 5 years ago

ryankarel commented 5 years ago

Here's the code I'm using:

import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV
import time

X = np.random.randn(1000 * 20).reshape(1000,20)
y = np.random.randn(1000)

# not specifying n_jobs
rf = RandomForestRegressor()

n_estimators = [5,10]
max_features = ['sqrt']

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features}

rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
                               n_iter=2, cv=3)

tic = time.perf_counter(),y)

toc = time.perf_counter()

"Time required to fit RF: {0:.1f} s".format(toc - tic)
# 0.2 seconds

# specifying n_jobs == 2
rf = RandomForestRegressor()

n_estimators = [5,10]
max_features = ['sqrt']

random_grid = {'n_estimators': n_estimators,
               'max_features': max_features}

rf_random = RandomizedSearchCV(estimator=rf, param_distributions=random_grid,
                               n_iter=2, cv=3, n_jobs=2, pre_dispatch=2)

tic = time.perf_counter(),y)

toc = time.perf_counter()

"Time required to fit RF: {0:.1f} s".format(toc - tic)
# never finishes in Rstudio using reticulate,
# but finishes in 2.0 seconds using Spyder

When I try to knit an RMD file with the above python code chunk, a terminal window appears with the text:

WARNING: unknown option '-c'
WARNING: unknown option '--multiprocessing-fork'

Here's my (abbreviated) py_config() output:

python:         C:\Users\nkarel\AppData\Local\CONTIN~1\ANACON~2\python.exe
libpython:      C:/Users/nkarel/AppData/Local/CONTIN~1/ANACON~2/python37.dll
pythonhome:     C:\Users\nkarel\AppData\Local\CONTIN~1\ANACON~2
version:        3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:\Users\nkarel\AppData\Local\CONTIN~1\ANACON~2\lib\site-packages\numpy
numpy_version:  1.15.4

As mentioned above, running the python script in a Python IDE works perfectly, but running it using the reticulate package with Rstudio results in an indefinitely hanging process. Is reticulate not meant to run multiple jobs, or is this a bug?

ryankarel commented 5 years ago

I should also mention that I'm using Windows 10.

ercbk commented 4 years ago

In issue 134 @jjallaire thinks it may be a problem with multiprocessing and embedded python interpreters, but that was back in 2017. Really surprised this problem isn't mentioned anywhere in the documentation or this issue hasn't received a response. Kind of a big deal.

kevinushey commented 4 years ago

For this particular issue, the problem is that sys$executable and sys$_base_executable do not point at the expected things for joblib:

sys <- import("sys")
## [1] "C:\\R\\R-36~1.1\\bin\\x64\\Rterm.exe"
## [1] "C:\\R\\R-36~1.1\\bin\\x64\\Rterm.exe"

I think we need to set these to the path to the Python interpreter for multiprocessing modules / joblib to work as expected.

skeydan commented 4 years ago

(((Aside question, is this expected to be a Windows-only problem? For me the above code works fine under Linux. But for me, sys$executable ( points to Python, not R...)))

ercbk commented 4 years ago

I think I'm encountering the same issue, but I'm just using a R script and not trying to source and knit a Rmd. So, I'll add my error report to the mix in case it helps any. Btw if there's a hack to get around this in the meantime, I'd appreciate that. Thank you

pacman::p_load(dials, reticulate)

sk_e <- import("sklearn.ensemble")
sk_ms <- import("sklearn.model_selection")

sim_data <- function(n) {
      tmp <- mlbench::mlbench.friedman1(n, sd=1)
      tmp <- cbind(tmp$x, tmp$y)
      tmp <-
      names(tmp)[ncol(tmp)] <- "y"

dat <- sim_data(10000)

pdat = r_to_py(dat)

y = pdat$pop('y')$values
X <- pdat

rf_est <- sk_e$RandomForestRegressor(criterion = "mae", random_state = 1L)
# rf_est <- sk_e$RandomForestRegressor(criterion = "mae", n_jobs = -1L, random_state = 1L)

rf_params <- r_to_py(dials::grid_latin_hypercube(
      mtry(range = c(3, 4)),
      trees(range = c(200, 300)),
      size = 40
max_features <- rf_params$pop('mtry')$values
n_estimators <- rf_params$pop('trees')$values
rf_grid <- py_dict(list('max_features', 'n_estimators'), list(max_features, n_estimators))

cv <- sk_ms$RepeatedKFold(n_splits = 2L,
                          n_repeats = 2L,
                          random_state = 1L)

mod_select <- sk_ms$GridSearchCV(estimator = rf_est,
                                 param_grid = rf_grid,
                                 scoring = 'neg_mean_absolute_error',
                                 cv = cv,
                                 n_jobs = -1L,
                                 refit = TRUE)

results <- mod_select$fit(X, y)
#> Error in py_call_impl(callable, dots$args, dots$keywords): OSError: [Errno 22] Invalid argument
#> Detailed traceback: 
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\sklearn\model_selection\", line 710, in fit
#>     self._run_search(evaluate_candidates)
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\sklearn\model_selection\", line 1151, in _run_search
#>     evaluate_candidates(ParameterGrid(self.param_grid))
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\sklearn\model_selection\", line 689, in evaluate_candidates
#>     cv.split(X, y, groups)))
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\joblib\", line 1004, in __call__
#>     if self.dispatch_one_batch(iterator):
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\joblib\", line 835, in dispatch_one_batch
#>     self._dispatch(tasks)
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\joblib\", line 754, in _dispatch
#>     job = self._backend.apply_async(batch, callback=cb)
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\joblib\", line 551, in apply_async
#>     future = self._workers.submit(SafeFunction(func))
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\joblib\externals\loky\", line 160, in submit
#>     fn, *args, **kwargs)
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\joblib\externals\loky\", line 1047, in submit
#>     self._ensure_executor_running()
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\joblib\externals\loky\", line 1021, in _ensure_executor_running
#>     self._adjust_process_count()
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\joblib\externals\loky\", line 1012, in _adjust_process_count
#>     p.start()
#>   File "C:\Users\tbats\Miniconda3\lib\multiprocessing\", line 112, in start
#>     self._popen = self._Popen(self)
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\joblib\externals\loky\backend\", line 39, in _Popen
#>     return Popen(process_obj)
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\joblib\externals\loky\backend\", line 55, in __init__
#>     process_obj._name, getattr(process_obj, "init_main_module", True))
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\joblib\externals\loky\backend\", line 86, in get_preparation_data
#>     _resource_tracker.ensure_running()
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\joblib\externals\loky\backend\", line 83, in ensure_running
#>     if self._check_alive():
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\joblib\externals\loky\backend\", line 163, in _check_alive
#>     self._send('PROBE', '', '')
#>   File "C:\Users\tbats\Miniconda3\lib\site-packages\joblib\externals\loky\backend\", line 185, in _send
#>     nbytes = os.write(self._fd, msg)

Created on 2020-01-30 by the reprex package (v0.3.0)

current session info

python:         C:/Users/tbats/Miniconda3/python.exe
libpython:      C:/Users/tbats/Miniconda3/python37.dll
pythonhome:     C:/Users/tbats/Miniconda3
version:        3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:/Users/tbats/Miniconda3/Lib/site-packages/numpy
numpy_version:  1.18.1
sklearn:        C:\Users\tbats\MINICO~1\lib\site-packages\sklearn\__init__.p

kevinushey commented 4 years ago

What worked for me was to re-assign those values in the sys and multiprocessing modules. That is, you can try running this at the top of your script (before importing any other modules):


# update executable path in sys module
sys <- import("sys")
exe <- file.path(sys$exec_prefix, "pythonw.exe")
sys$executable <- exe
sys$`_base_executable` <- exe

# update executable path in multiprocessing module
multiprocessing <- import("multiprocessing")
ercbk commented 4 years ago

Worked! Thanks again. One more question - will the pythonw.exe instances eventually end on their own or should I terminate them? I had to end the first session I tried this, and those pythonw.exe instances are still around along with the ones from this second run.

Nevermind. They ended when I quit RStudio.

kevinushey commented 4 years ago

Yes indeed, they'll exit when the R session is shut down. (I'm not sure whether Python tries to re-use the existing child sessions, or if they should normally be shut down after running the requisite code, though.)

ercbk commented 4 years ago

Ran a similar script overnight through RScript.exe. It's been over 3.5 hrs since the job finished, and the python instances didn't terminate. So, I guess these might need to be shut down.

vermosen commented 3 years ago


I see a similar problem in centos 8 using both base R and Rstudio. Consider the following Rscript:


use_condaenv(condaenv = 'py38', conda = '/opt/miniconda/bin/conda')

sk        <- NULL
sk$ds  <- import("sklearn.datasets")
sk$ms <- import("sklearn.model_selection")
sk$da  <- import("sklearn.discriminant_analysis")

# define dataset
data <- sk$ds$make_classification(n_samples=1000L, n_features=10L, n_informative=10L, n_redundant=0L, random_state=1L)

# define model
model <- sk$da$LinearDiscriminantAnalysis()

# define model evaluation method
cv <- sk$ms$RepeatedStratifiedKFold(n_splits=10L, n_repeats=3L, random_state=1L)

# evaluate the model
scores <- sk$ms$cross_val_score(model, data[[1]], data[[2]]
                      , scoring='accuracy', cv=cv, n_jobs=2L)

# Error: C stack usage  331680283904 is too close to the limit
cat('score: ', score)

as soon the n_jobs value create forks (i.e. n_jobs != 1), the R session crashes on the last line with the message above. Also, I end up with stuck python processes I have to kill manually.

I checked the executable and _base_executable values mentioned and there are both set to the correct python binary.

Here is my R setup:

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 8

Matrix products: default
BLAS/LAPACK: /opt/r40/lib64/

 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] reticulate_1.18

loaded via a namespace (and not attached):
[1] compiler_4.0.3  Matrix_1.2-18   Rcpp_1.0.5      grid_4.0.3
[5] jsonlite_1.7.1  lattice_0.20-41
my miniconda env

name: py38
channels:
  - conda-forge
  - defaults
dependencies:
  - python=3.8.5=h7579374_1
  - scikit-learn=0.23.2=py38h0573a6f_0
  - scipy=1.5.2=py38h0b6359f_0
  - numpy=1.19.2=py38h54aff64_0
  - pandas=1.1.3=py38he6710b0_0
  - matplotlib=3.3.2=0
  - joblib=0.17.0=py_0
prefix: /opt/miniconda/envs/py38
shivam7898 commented 2 years ago

Reiterating this comment above, because it took me sometime, to do the same in a Python chunk or file in Windows 10

import sys, os, multiprocessing
q_EXE_PATH = os.path.join(sys.exec_prefix, 'pythonw.exe')
sys.executable = q_EXE_PATH
sys._base_executable = q_EXE_PATH