TerminatedWorkerError - Githubissues

AlJohri commented 5 years ago

I keep running into a TerminatedWorkerError when running clf.fit with skope rules. I seem to have ample memory so I'm unsure what's going on. Any potential ideas?

Traceback (most recent call last):
  File "experiment.py", line 171, in <module>
    result = process(topic)
  File "experiment.py", line 95, in process
    clf.fit(features, training_data_labels)
  File "/home/ubuntu/.local/share/virtualenvs/taxonomy-analysis2-BU9HWu51/lib/python3.7/site-packages/skrules/skope_rules.py", line 312, in fit
    clf.fit(X, y)
  File "/home/ubuntu/.local/share/virtualenvs/taxonomy-analysis2-BU9HWu51/lib/python3.7/site-packages/sklearn/ensemble/bagging.py", line 244, in fit
    return self._fit(X, y, self.max_samples, sample_weight=sample_weight)
  File "/home/ubuntu/.local/share/virtualenvs/taxonomy-analysis2-BU9HWu51/lib/python3.7/site-packages/sklearn/ensemble/bagging.py", line 378, in _fit
    for i in range(n_jobs))
  File "/home/ubuntu/.local/share/virtualenvs/taxonomy-analysis2-BU9HWu51/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 930, in __call__
    self.retrieve()
  File "/home/ubuntu/.local/share/virtualenvs/taxonomy-analysis2-BU9HWu51/lib/python3.7/site-packages/sklearn/externals/joblib/parallel.py", line 833, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/ubuntu/.local/share/virtualenvs/taxonomy-analysis2-BU9HWu51/lib/python3.7/site-packages/sklearn/externals/joblib/_parallel_backends.py", line 521, in wrap_future_result
    return future.result(timeout=timeout)
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
sklearn.externals.joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

ogrisel commented 5 years ago

Could you please provide a minimal reproduction case?

AlJohri commented 5 years ago

@ogrisel the code is fairly intertwined at the moment so creating a minimal reproduction will be difficult. if you have some debugging strategies for this type of issue I may be able to narrow it down first.

here is what my code looks like. this is a multi-class, multi-label problem transfromed into multiple single class problems:


neg_to_pos_ratio = 1.0

all_training_data = [{'id': '...', 'headline': ..., 'text': ..., 'topics': [..., ...]}]
all_test_data = [{'id': '...', 'headline': ..., 'text': ..., 'topics': [..., ...]}]

def process(topic):

    # find tagged data for topic (positive) and the remaining data ("negative")
    positive_data = [row for row in all_training_data if has_topic(topic, row)]
    negative_data = [row for row in all_training_data if not has_topic(topic, row)]

    # sample negative data to balance positive data
    sampled_positive_data = sample_or_all(positive_data, num_pos_training_data)
    sampled_negative_data = sample_or_all(negative_data, len(sampled_positive_data) * neg_to_pos_ratio)

    # create balanced training data
    training_data = sampled_positive_data + sampled_negative_data

    training_data_labels = [has_topic(topic, row) for row in training_data]
    training_data_stories = [get_text(story) for story in training_data]

    featurizer = CountVectorizer(
        stop_words='english',
        max_df=0.9,
        min_df=0.01, binary=True, analyzer='word')
    features = featurizer.fit_transform(training_data_stories).toarray()

    clf = SkopeRules(max_depth_duplication=2,
                    n_estimators=10,
                    precision_min=0.5,
                    recall_min=0.1,
                    verbose=2,
                    n_jobs=-1,
                    feature_names=["w_" + x.replace(' ', '_') for x in featurizer.get_feature_names()])

    clf.fit(features, training_data_labels)

    # .... more code here

for topic in topics:
    result = process(topic)

the error triggers on clf.fit. It always crashes after the same number of topics get processed (2 or 3) and I watched the code with top and it seems like the memory usage is fine. I'm running on an EC2 with 32 gigabytes of memory and 8 cores

if I remove n_jobs, then the script runs to completion

ngoix commented 5 years ago

This is probably unrelated to skope-rules as n_jobs is just passed to sklearn ensemble estimators

ghost commented 5 years ago

facing similar issue... have 50kdatapoints and ample memory

CODE:

%%time

auc_cv_dict = {} auc_tr_dict = {}

for i in range(3, 50, 4): knn = KNeighborsClassifier(n_neighbors=i, algorithm='brute', weights='uniform', n_jobs=-1) knn.fit(xtr, dtrain['numeric_score'])

#performance metrics for cv data:
y_pred_cv = knn.predict_proba(xcv)        
fpr_cv, tpr_cv, thresholds_cv = roc_curve(ycv, y_pred_cv[:,1])
auc_cv_dict[i] = auc(fpr_cv, tpr_cv)

#performance metrics for training data:
y_pred_tr = knn.predict_proba(xtr)
fpr_tr, tpr_tr, thresholds_tr = roc_curve(dtrain['numeric_score'], y_pred_tr[:,1])
auc_tr_dict[i] = auc(fpr_tr, tpr_tr)

ERROR:

TerminatedWorkerError Traceback (most recent call last)

in ~/.local/lib/python3.5/site-packages/sklearn/neighbors/classification.py in predict_proba(self, X) 191 X = check_array(X, accept_sparse='csr') 192 --> 193 neigh_dist, neigh_ind = self.kneighbors(X) 194 195 classes_ = self.classes_ ~/.local/lib/python3.5/site-packages/sklearn/neighbors/base.py in kneighbors(self, X, n_neighbors, return_distance) 433 X, self._fit_X, reduce_func=reduce_func, 434 metric=self.effective_metric_, n_jobs=n_jobs, --> 435 **kwds)) 436 437 elif self._fit_method in ['ball_tree', 'kd_tree']: ~/.local/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in pairwise_distances_chunked(X, Y, reduce_func, metric, n_jobs, working_memory, **kwds) 1300 X_chunk = X[sl] 1301 D_chunk = pairwise_distances(X_chunk, Y, metric=metric, -> 1302 n_jobs=n_jobs, **kwds) 1303 if ((X is Y or Y is None) 1304 and PAIRWISE_DISTANCE_FUNCTIONS.get(metric, None) ~/.local/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds) 1430 func = partial(distance.cdist, metric=metric, **kwds) 1431 -> 1432 return _parallel_pairwise(X, Y, func, n_jobs, **kwds) 1433 1434 ~/.local/lib/python3.5/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds) 1071 ret = Parallel(n_jobs=n_jobs, verbose=0)( 1072 fd(X, Y[s], **kwds) -> 1073 for s in gen_even_slices(_num_samples(Y), effective_n_jobs(n_jobs))) 1074 1075 return np.hstack(ret) ~/.local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in __call__(self, iterable) 928 929 with self._backend.retrieval_context(): --> 930 self.retrieve() 931 # Make sure that we get a last message telling us we are done 932 elapsed_time = time.time() - self._start_time ~/.local/lib/python3.5/site-packages/sklearn/externals/joblib/parallel.py in retrieve(self) 831 try: 832 if getattr(self._backend, 'supports_timeout', False): --> 833 self._output.extend(job.get(timeout=self.timeout)) 834 else: 835 self._output.extend(job.get()) ~/.local/lib/python3.5/site-packages/sklearn/externals/joblib/_parallel_backends.py in wrap_future_result(future, timeout) 519 AsyncResults.get from multiprocessing.""" 520 try: --> 521 return future.result(timeout=timeout) 522 except LokyTimeoutError: 523 raise TimeoutError() /usr/lib/python3.5/concurrent/futures/_base.py in result(self, timeout) 403 raise CancelledError() 404 elif self._state == FINISHED: --> 405 return self.__get_result() 406 else: 407 raise TimeoutError() /usr/lib/python3.5/concurrent/futures/_base.py in __get_result(self) 355 def __get_result(self): 356 if self._exception: --> 357 raise self._exception 358 else: 359 return self._result TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

upendra431 commented 5 years ago

I got this from stackoverflow: it resolved my issue. https://stackoverflow.com/questions/54139403/how-do-i-fix-debug-this-multi-process-terminated-worker-error-thrown-in-scikit-l

I figured out the my scipy module was incompatible with my windows 10 C++ redistributable version.

All i did was download the latest visual studio and installed the C++ redistributable update that is listed in the "individual components" section.

Once I installed that I restarted my computer and ran.

import scipy scipy.test() Once that was actually running I attempted my code block above and it fixed.

I think what this boils down to is installing an old build of windows 10 with a brand new version of python and scipy

This took a LONG time to solve and debug. Hopefully it helps.

upendra431 commented 5 years ago

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6)}

Why i am getting this error

maninekkalapudi commented 5 years ago

I'm facing the following error on Debian based GCP server:

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6)}

I'm facing the above error at clf.fit(x_train_multilabel, y_train). I certainly don't know anything about C++ packages and I try not to change anything.

start = datetime.now() hyper_param = {'estimator__C': [10**-5,10**-4, 10**-3, 10**-2, 10**-1, 1, 10**1, 10**2, 10**3, 10**4,10**5]}

classifier = OneVsRestClassifier(LogisticRegression(penalty='l1'))

clf = GridSearchCV(classifier, hyper_param, scoring = 'f1_micro', cv=10, n_jobs=-1)

clf.fit(x_train_multilabel, y_train)

print("Time taken to run this cell :", datetime.now() - start)

arindam2007b commented 5 years ago

I am facing the exact same issue, while running gridsearchcv. Anyone found any solution yet?

qmilangowin commented 5 years ago

Get this now on a Ubuntu EC2 instance (c4.xlarge) with GridSearch in Jupyter Notebooks:

param_grid=[{
    'vect__ngram_range':[(1,1),(1,2),(1,3)],
    'clf__alpha':(1e-2,1e-3)}]

gs_clf=GridSearchCV(text_clf_NB,param_grid,n_jobs=-1)
gs_clf=gs_clf.fit(X_train,y_train)

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)

AlJohri commented 5 years ago

Does anyone here have a small dataset they would be willing to share to create a reproducible example?

anacarolinarocha commented 5 years ago

GridSearchCV works fine for me with SVC, LinearSVC, MultinomialNB e RandomForest. I'm facing this problem only with MultilayerPerceptron. All attempts with all algorithms have n_jobs >1 .

wayneli215 commented 4 years ago

I solve this problem by reinstall anaconda. I use jupyter notebook in ubuntu system.

IloBe commented 4 years ago

I have got the same issue with GridSearchCV for RandomForestClassifier and n_jobs=-1 in Jupyter Notebooks, running on paperspace with GPU+ container; the dataset has been a cleaned disaster messages one from figure 8; coding is

` pipeline = Pipeline([ ('features', FeatureUnion([
('text_pipeline', Pipeline([ ('vect', CountVectorizer(tokenizer=tokenize, ngram_range=(1,2))), ('tfidf', TfidfTransformer(sublinear_tf=True)), ]))
])),
('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators=100, class_weight='balanced', n_jobs=-1, random_state=FIXED_SEED))) ])

rfc_param_grid = { 'featurestext_pipelinevectngram_range': [(1, 2), (1,3)], 'clfestimatorn_estimators': [10, 100, 500, 1000], 'clfestimatormax_depth': [None, 5, 10], 'clfestimator__class_weight': ['balanced', 'balanced_subsample'] }

grid_cv = GridSearchCV(pipeline, param_grid=rfc_param_grid, n_jobs=-1, cv=5, verbose=1) grid_cv.fit(X_train, y_train) ` As expected, it does not happen, if the pipeline is used alone, without GridSearchCV.

anacarolinarocha commented 4 years ago

I found out what the problem was for me! Since it was using multiple cores, at some time, it reached maximum RAM memory usage, leading the OS to terminate the processes unexpectedly. I just decreasead the number of cores used on parallelization for some models (the ones that demanded more memory) and it worked!

You can take a look at the OS log to see if you happen to be having such a problem.

Hope it helps you guys!

vbaryshev4 commented 4 years ago

I found out what the problem was for me! Since it was using multiple cores, as some time, it reached maximum RAM memory usage, leading the OS to terminate the processes unexpectedly. I just decreasead the number of cores used on parallelization for some models (the ones that demanded more memory) and it worked!

You can take a look at the OS log to see if you happen to be having such a problem.

Hope it helps you guys!

Yes, as evidence of RunTime Error on Ubuntu you could see unused swap memory allocated in SWP-bar via htop

sarajcev commented 4 years ago

I have the same issue on Ubuntu 18.04 with 16GB RAM and Anaconda (Python 3.7 and scikit-learn 0.21) on this simple example:

from sklearn.linear_model import LogisticRegression as LR
# Logistic Regression (with fixed hyper-parameters)
lreg = LR(C=100.,  # fixed "C" hyper-parameter
          multi_class='ovr', solver='newton-cg', class_weight='balanced', n_jobs=4)
lreg.fit(X_train, y_train)  # fit model to data
y_lr = lreg.predict_proba(X_test)  # predict on new data

The code fails at the fit line with the following message:

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6)}

When I use n_jobs=1 the code runs just fine. With any other value for the n_jobs, including -1 it fails with the same message.

I know that this code was running without errors on this dataset with n_jobs=-1 until now (maybe I updated some Anaconda packages in the meantime, I don't remember?).

kid3night commented 4 years ago

Facing the same issue when I tried to run RandomizedSearchCV with n_jobs larger than 1. Is there any way to solve this problem now? I run it on the Mac OS 10.15.1.

My sklearn version is '0.21.3'

A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6)}

trewaite commented 4 years ago

Encountered same issue using RandomizedSearchCV when passing a MultiOutputRegressor wrapped XGBRegressor.

sklearn version is '0.20.4'

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}

mo_jobs = 1
grid_jobs = 40

cv = TimeSeriesSplit(3)
estimator = xg.XGBRegressor()
mo_estimator = MultiOutputRegressor(estimator,n_jobs=mo_jobs)

param_grid = {'estimator__silent': [True],
            'estimator__max_depth': [6, 10, 15, 20],
            'estimator__learning_rate': [0.01, 0.1],
            'estimator__subsample': [0.7, 0.8, 0.9, 1.0],
            'estimator__colsample_bytree': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
            'estimator__colsample_bylevel': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
            'estimator__min_child_weight': [0.1, 0.5, 1.0, 3.0, 5.0, 7.0, 10.0, 13.0],
            'estimator__gamma': [0, 0.1, 0.25, 0.5],
            'estimator__reg_lambda': [0.1, 1.0, 5.0, 10.0, 50.0],
            'estimator__n_estimators': [100]}

grid = RandomizedSearchCV(estimator=mo_estimator,
                          cv=cv,
                          param_distributions=param_grid,
                          n_iter=10,
                          verbose=2,
                          scoring='neg_mean_squared_error',
                          n_jobs=int(grid_jobs/mo_jobs),
                          pre_dispatch=int(grid_jobs/mo_jobs))

grid.fit(X_train,y_train)

Note my cluster has 64 cores, and I am hitting this error only using 40 cores and only n_iter = 10 of RandomizedSearchCV

Hockwell commented 4 years ago

sc-learn v.0.22.1. Similar situation. The program consumes little RAM. self._clf = BaggingClassifier(base_estimator=algs_objs[0].clf, n_estimators=10, n_jobs = -1, bootstrap = True, max_samples=0.95) self._clf.fit(X, y) Mitigation: n_jobs=1. Important: my algorithms for BaggingClf use n_jobs=1, not -1.

pavel-kalmykov commented 4 years ago

I am also having this SIGABRT(-6) error as many have already posted here, but when I run the same notebook in Google Colab, I get the following:

/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
  "timeout or by a memory leak.", UserWarning

wanpingDou commented 4 years ago

sc-learn v.0.22.1. Similar situation. The program consumes little RAM. self._clf = BaggingClassifier(base_estimator=algs_objs[0].clf, n_estimators=10, n_jobs = -1, bootstrap = True, max_samples=0.95) self._clf.fit(X, y) Mitigation: n_jobs=1. Important: my algorithms for BaggingClf use n_jobs=1, not -1.

very useful ! THX

pnmartinez commented 4 years ago

I found out what the problem was for me! Since it was using multiple cores, as some time, it reached maximum RAM memory usage, leading the OS to terminate the processes unexpectedly. I just decreasead the number of cores used on parallelization for some models (the ones that demanded more memory) and it worked!

You can take a look at the OS log to see if you happen to be having such a problem.

Hope it helps you guys!

This is what made it for me. Turns out allocating all CPUs can be unstable, specially when there are other independent programs running that can suddenly have an uncontrolled spike in memory usage.

The full case with n_jobs:

n_jobs = -1  # parallelization in all CPUs (until last element of the array of cpus, hence -1)
n_jobs = -2  # parallelization in all CPUs but 1 (until the previous element from the last, hence -2)
...
n_jobs = 1  #  parallelization deactivated

So n_jobs = -2 did it for me and should be enough, and clearly more efficient than n_jobs = 1.

EDIT: This is, however, only a nice workaround, not a fix, as @seanlseymour says below.

seanlseymour commented 4 years ago

I'm seeing that I can avoid this issue for some classifiers by setting n_jobs to -2, however not all. For example LogisticRegression produces this error as does Bagging. RandomForest, SVC, KNeighborsClassifier, XGBoost work. The tracebacks on failures don't always point to the same place, consistent with the lack of consistency cited in this thread. Sometimes the issue is at cross_validate, sometimes at learning_curve, sometimes at GridSearchCV, or RandomizedSearchCV - all seem to be from sklearn.model_selection. The only other common theme I see is they all tracebacks hit python3.7/site-packages/joblib/parallel.py. I'm sure this issue did not happen before switching to Catalina, but I'm not sure it was triggered immediately, so perhaps something else or a combo is the problem. I'm really hoping someone who understands this much more deeply than I do will dig into this for a real fix. Even if n_jobs = -2 always worked, that's still just a workaround, not a fix, right? Any progress here greatly appreciated!

My config: OS Catalina 10.15.5 Python 3.7 Anaconda 4.4.7 (reinstalled per suggestions, no effect) scikit_learn 23.1 matplotlib 3.2.1 16 GB RAM (free RAM is never the actual issue as far as I can tell)

mdashkezari commented 4 years ago

updating matplotlib did it for me: pip install -U matplotlib

macOS Catalina 10.15.6 sklearn: 0.23.2 numpy: 1.19.1 scipy: 1.4.1 Cython: 0.29.21 pandas: 1.0.5 matplotlib: 3.3.1 joblib: 0.16.0 threadpoolctl: 2.1.0

ybagdasa commented 3 years ago

I'm encountering the error

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.                                                                       

The exit codes of the workers are {EXIT(1)}

when running an instance of GridSearchCV on a DecisionTreeClassifier with n_jobs!=1. I tried updating sklearn and matplotlib with conda, but the problem persists. I am able to run RandomForestClassifier with n_jobs!=1 without any issue.

ybagdasa commented 3 years ago

A workaround for now:

with parallel_backend('threading',n_jobs=8):
   fitGridSearchDecisionTree(data,clf_args) #my code that calls instance of GridSearchCV.fit with n_jobs=None

This uses multithreading rather than multiprocessing (if I understand correctly) but it still results in a much faster execution of gridsearch.

gtg472b commented 3 years ago

I kept getting this error even with n_jobs=1. Turns out I found a hidden error:

--------------------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/steve/anaconda3/envs/rapidsai-0.17/lib/python3.7/site-packages/joblib/externals/loky/backend/popen_loky_posix.py", line 197, in <module>
    prep_data = pickle.load(from_parent)
ValueError: unsupported pickle protocol: 5

My only workaround was to set LOKY_PICKLER='pickle' https://buildmedia.readthedocs.org/media/pdf/joblib/latest/joblib.pdf

I can't seem to find much info on this...anyone know why the default cloudpickle is using protocol 5? It appears that has to do with python 3.8, but I have 3.7.8 installed

vss888 commented 3 years ago

If it helps, I am having this problem while trying to run multiple XGBoost models in parallel. I.e. I use joblib to read from disk multiple copies of an XGBoost model, which then consume incoming MQ messages to make predictions. I do not see high RAM usage in the system monitor (15-20% of RAM is used). The models start and run fine for some time, but at some moment I get a crash with the same error, i.e.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  ...
  File ".../Python-3.6.3/lib/python3.6/site-packages/joblib/parallel.py", line 930, in __call__
    self.retrieve()
  File ".../Python-3.6.3/lib/python3.6/site-packages/joblib/parallel.py", line 833, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File ".../Python-3.6.3/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 521, in wrap_future_result
    return future.result(timeout=timeout)
  File ".../Python-3.6.3/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File ".../Python-3.6.3/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6)}

In one test reproducing the problem, if I run 40 model in parallel - I get the crash, but if I run 30 models in parallel - the crash does not occur.

mayujie commented 3 years ago

sc-learn v.0.22.1. Similar situation. The program consumes little RAM. self._clf = BaggingClassifier(base_estimator=algs_objs[0].clf, n_estimators=10, n_jobs = -1, bootstrap = True, max_samples=0.95) self._clf.fit(X, y) Mitigation: n_jobs=1. Important: my algorithms for BaggingClf use n_jobs=1, not -1.

Very helpful!!! Thank you

dharathakkar5 commented 3 years ago

Is this issue fixed? I am facing a similar error with sklearn.grid_search.RandomizedSearchCV with n_jobs = 4. Number of cores=8 2 million rows of data.

mayujie commented 3 years ago

Is this issue fixed? I am facing a similar error with sklearn.grid_search.RandomizedSearchCV with n_jobs = 4. Number of cores=8 2 million rows of data.

What kind of model are you searching? Keras model or sklearn model? If keras I suggest you could use keras tuner package for that.

pplonski commented 3 years ago

Got a similar issue in AutoML that I'm working on. The solution was to update the joblib package to 1.0.1.

pip install -U joblib==1.0.1

ogrisel commented 3 years ago

I think we should close this issue. joblib workers can crash for a variety of reasons (e.g. not enough memory on the system to use parallelism, installation problems and so on) and we should open one issue per problem, provided we have enough information to reproduce the problem.

In the comments above, most reports are unrelated to the skope-rules library and do not actually use it at all.

If you face such a problem in your code without importing skope-rules, please:

make sure your version of joblib is up to date: https://pypi.org/project/joblib/
check the memory usage of your system when you execute the code that crashes first (use the task manager on Windows, Activity Monitor on macOS or top / htop on Linux for instance): if the RAM is exhausted, it's normal to get a crash when using too many workers. Try using n_jobs=2 instead of n_jobs=-1 and monitor RAM usage again before growing the n_jobs value;
if you still get the problem, open an issue on the github repo of the library you actually use in your code (for instance scikit-learn or directly joblib);
mention the version of joblib, scikit-learn installed in the Python environment you use to get the crash. For instance you can use: python -c "import sklearn; sklearn.show_versions()"
please include a minimal reproduce code snippet, including all the import statements and code to generate random data, for instance using https://scikit-learn.org/stable/datasets/sample_generators.html or the functions of the numpy.random module. If you do not make the effort to provide us with a minimal reproducer it's very likely that nobody will be able to help you.

A minimal reproducer should be small (e.g. no more than 20 lines of python) and stand-alone: anyone should be able to execute the code, for instance by copy and pasting the code snippet in a IPython or jupyter session.

paulmattheww commented 3 years ago

I found out what the problem was for me! Since it was using multiple cores, at some time, it reached maximum RAM memory usage, leading the OS to terminate the processes unexpectedly. I just decreasead the number of cores used on parallelization for some models (the ones that demanded more memory) and it worked!

You can take a look at the OS log to see if you happen to be having such a problem.

Hope it helps you guys!

I tried something similar, wherein I set my regressor to have njobs = 4 while the grid-search is set to use almost all the CPUs available. Is this similar to what you did?

ishanuc commented 2 years ago

This issue still exists as of 2022. Closing the issue, and pretending it went away (or use njobs=1 for "parallelization") does not fix the issue. Demanding "minimal examples" when the issue shows up for complicated working code is also unreasonable. I understand this is hard to track bug, but the above "solutions" are not solutions.

surfablebot commented 2 years ago

Got the same error. There is a bug, hope the following helps:

192vCPU 786 GB Memory Canonical, Ubuntu, 22.04 LTS, amd64 jammy image build on 2022-06-09

scikit-learn==1.1.2 joblib==1.1.0 catboost==1.0.6 lightgbm==3.3.2 scipy==1.9.0 scikit-learn==1.1.2 scikit-optimize==0.9.0 filelock==3.8.0 progressbar2==4.0.0 numpy==1.23.2 pandas==1.4.3 tabulate==0.8.10 pycoingecko==2.2.0 jinja2==3.1.2 tables==3.7.0 blosc==1.10.6 joblib==1.1.0 python==3.10

File "/home/ubuntu/mmm/.env/lib/python3.10/site-packages/joblib/parallel.py", line 1056, in call self.retrieve() File "/home/ubuntu/mmm/.env/lib/python3.10/site-packages/joblib/parallel.py", line 935, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/home/ubuntu/mmm/.env/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result return future.result(timeout=timeout) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 446, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 391, in get_result raise self._exception joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.

The exit codes of the workers are {SIGKILL(-9)}

vbyan commented 1 year ago

Had the same issue while running sklearn.model_selection.cross_validate in PyCharm. I resolved it by increasing the heap memory of the IDE. For PyCharm it's 750 MiB by default which can trigger the TerminatedWorkerError especially when working with huge databases.

Hope this is helpful.

ibraym commented 8 months ago

I had the same problem today. I run RandomizedSearchCV with n_jobs=5 on m5.2xlarge AWS instance (8 cores, 32G RAM). I solved it by changing n_jobs to 4 and adding the following:

import gc
gc.set_threshold(0)

For more info about this trick, read this article

scikit-learn-contrib / skope-rules

TerminatedWorkerError #18