Closed AlJohri closed 3 years ago
Could you please provide a minimal reproduction case?
@ogrisel the code is fairly intertwined at the moment so creating a minimal reproduction will be difficult. if you have some debugging strategies for this type of issue I may be able to narrow it down first.
here is what my code looks like. this is a multi-class, multi-label problem transfromed into multiple single class problems:
neg_to_pos_ratio = 1.0
all_training_data = [{'id': '...', 'headline': ..., 'text': ..., 'topics': [..., ...]}]
all_test_data = [{'id': '...', 'headline': ..., 'text': ..., 'topics': [..., ...]}]
def process(topic):
# find tagged data for topic (positive) and the remaining data ("negative")
positive_data = [row for row in all_training_data if has_topic(topic, row)]
negative_data = [row for row in all_training_data if not has_topic(topic, row)]
# sample negative data to balance positive data
sampled_positive_data = sample_or_all(positive_data, num_pos_training_data)
sampled_negative_data = sample_or_all(negative_data, len(sampled_positive_data) * neg_to_pos_ratio)
# create balanced training data
training_data = sampled_positive_data + sampled_negative_data
training_data_labels = [has_topic(topic, row) for row in training_data]
training_data_stories = [get_text(story) for story in training_data]
featurizer = CountVectorizer(
stop_words='english',
max_df=0.9,
min_df=0.01, binary=True, analyzer='word')
features = featurizer.fit_transform(training_data_stories).toarray()
clf = SkopeRules(max_depth_duplication=2,
n_estimators=10,
precision_min=0.5,
recall_min=0.1,
verbose=2,
n_jobs=-1,
feature_names=["w_" + x.replace(' ', '_') for x in featurizer.get_feature_names()])
clf.fit(features, training_data_labels)
# .... more code here
for topic in topics:
result = process(topic)
the error triggers on clf.fit
. It always crashes after the same number of topics get processed (2 or 3) and I watched the code with top
and it seems like the memory usage is fine. I'm running on an EC2 with 32 gigabytes of memory and 8 cores
if I remove n_jobs
, then the script runs to completion
This is probably unrelated to skope-rules as n_jobs
is just passed to sklearn ensemble estimators
facing similar issue... have 50kdatapoints and ample memory
CODE:
%%time
auc_cv_dict = {} auc_tr_dict = {}
for i in range(3, 50, 4): knn = KNeighborsClassifier(n_neighbors=i, algorithm='brute', weights='uniform', n_jobs=-1) knn.fit(xtr, dtrain['numeric_score'])
#performance metrics for cv data:
y_pred_cv = knn.predict_proba(xcv)
fpr_cv, tpr_cv, thresholds_cv = roc_curve(ycv, y_pred_cv[:,1])
auc_cv_dict[i] = auc(fpr_cv, tpr_cv)
#performance metrics for training data:
y_pred_tr = knn.predict_proba(xtr)
fpr_tr, tpr_tr, thresholds_tr = roc_curve(dtrain['numeric_score'], y_pred_tr[:,1])
auc_tr_dict[i] = auc(fpr_tr, tpr_tr)
ERROR:
TerminatedWorkerError Traceback (most recent call last)
I got this from stackoverflow: it resolved my issue. https://stackoverflow.com/questions/54139403/how-do-i-fix-debug-this-multi-process-terminated-worker-error-thrown-in-scikit-l
I figured out the my scipy module was incompatible with my windows 10 C++ redistributable version.
All i did was download the latest visual studio and installed the C++ redistributable update that is listed in the "individual components" section.
Once I installed that I restarted my computer and ran.
import scipy scipy.test() Once that was actually running I attempted my code block above and it fixed.
I think what this boils down to is installing an old build of windows 10 with a brand new version of python and scipy
This took a LONG time to solve and debug. Hopefully it helps.
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6)}
Why i am getting this error
I'm facing the following error on Debian based GCP server:
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6)}
I'm facing the above error at clf.fit(x_train_multilabel, y_train)
. I certainly don't know anything about C++ packages and I try not to change anything.
start = datetime.now()
hyper_param = {'estimator__C': [10**-5,10**-4, 10**-3, 10**-2, 10**-1, 1, 10**1, 10**2, 10**3, 10**4,10**5]}
classifier = OneVsRestClassifier(LogisticRegression(penalty='l1'))
clf = GridSearchCV(classifier, hyper_param, scoring = 'f1_micro', cv=10, n_jobs=-1)
clf.fit(x_train_multilabel, y_train)
print("Time taken to run this cell :", datetime.now() - start)
I am facing the exact same issue, while running gridsearchcv. Anyone found any solution yet?
Get this now on a Ubuntu EC2 instance (c4.xlarge) with GridSearch in Jupyter Notebooks:
param_grid=[{
'vect__ngram_range':[(1,1),(1,2),(1,3)],
'clf__alpha':(1e-2,1e-3)}]
gs_clf=GridSearchCV(text_clf_NB,param_grid,n_jobs=-1)
gs_clf=gs_clf.fit(X_train,y_train)
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)
Does anyone here have a small dataset they would be willing to share to create a reproducible example?
GridSearchCV works fine for me with SVC, LinearSVC, MultinomialNB e RandomForest. I'm facing this problem only with MultilayerPerceptron. All attempts with all algorithms have n_jobs >1 .
I solve this problem by reinstall anaconda. I use jupyter notebook in ubuntu system.
I have got the same issue with GridSearchCV for RandomForestClassifier and n_jobs=-1 in Jupyter Notebooks, running on paperspace with GPU+ container; the dataset has been a cleaned disaster messages one from figure 8; coding is
`
pipeline = Pipeline([
('features', FeatureUnion([
('text_pipeline', Pipeline([
('vect', CountVectorizer(tokenizer=tokenize, ngram_range=(1,2))),
('tfidf', TfidfTransformer(sublinear_tf=True)),
]))
])),
('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators=100, class_weight='balanced',
n_jobs=-1, random_state=FIXED_SEED)))
])
rfc_param_grid = { 'featurestext_pipelinevectngram_range': [(1, 2), (1,3)], 'clfestimatorn_estimators': [10, 100, 500, 1000], 'clfestimatormax_depth': [None, 5, 10], 'clfestimator__class_weight': ['balanced', 'balanced_subsample'] }
grid_cv = GridSearchCV(pipeline, param_grid=rfc_param_grid, n_jobs=-1, cv=5, verbose=1) grid_cv.fit(X_train, y_train) ` As expected, it does not happen, if the pipeline is used alone, without GridSearchCV.
I found out what the problem was for me! Since it was using multiple cores, at some time, it reached maximum RAM memory usage, leading the OS to terminate the processes unexpectedly. I just decreasead the number of cores used on parallelization for some models (the ones that demanded more memory) and it worked!
You can take a look at the OS log to see if you happen to be having such a problem.
Hope it helps you guys!
I found out what the problem was for me! Since it was using multiple cores, as some time, it reached maximum RAM memory usage, leading the OS to terminate the processes unexpectedly. I just decreasead the number of cores used on parallelization for some models (the ones that demanded more memory) and it worked!
You can take a look at the OS log to see if you happen to be having such a problem.
Hope it helps you guys!
Yes, as evidence of RunTime Error on Ubuntu you could see unused swap memory allocated in SWP-bar via htop
I have the same issue on Ubuntu 18.04 with 16GB RAM and Anaconda (Python 3.7 and scikit-learn 0.21) on this simple example:
from sklearn.linear_model import LogisticRegression as LR
# Logistic Regression (with fixed hyper-parameters)
lreg = LR(C=100., # fixed "C" hyper-parameter
multi_class='ovr', solver='newton-cg', class_weight='balanced', n_jobs=4)
lreg.fit(X_train, y_train) # fit model to data
y_lr = lreg.predict_proba(X_test) # predict on new data
The code fails at the fit
line with the following message:
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6)}
When I use n_jobs=1
the code runs just fine. With any other value for the n_jobs
, including -1
it fails with the same message.
I know that this code was running without errors on this dataset with n_jobs=-1
until now (maybe I updated some Anaconda packages in the meantime, I don't remember?).
Facing the same issue when I tried to run RandomizedSearchCV with n_jobs larger than 1. Is there any way to solve this problem now? I run it on the Mac OS 10.15.1.
My sklearn version is '0.21.3'
A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6)}
Encountered same issue using RandomizedSearchCV when passing a MultiOutputRegressor wrapped XGBRegressor.
sklearn version is '0.20.4'
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9)}
mo_jobs = 1
grid_jobs = 40
cv = TimeSeriesSplit(3)
estimator = xg.XGBRegressor()
mo_estimator = MultiOutputRegressor(estimator,n_jobs=mo_jobs)
param_grid = {'estimator__silent': [True],
'estimator__max_depth': [6, 10, 15, 20],
'estimator__learning_rate': [0.01, 0.1],
'estimator__subsample': [0.7, 0.8, 0.9, 1.0],
'estimator__colsample_bytree': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
'estimator__colsample_bylevel': [0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
'estimator__min_child_weight': [0.1, 0.5, 1.0, 3.0, 5.0, 7.0, 10.0, 13.0],
'estimator__gamma': [0, 0.1, 0.25, 0.5],
'estimator__reg_lambda': [0.1, 1.0, 5.0, 10.0, 50.0],
'estimator__n_estimators': [100]}
grid = RandomizedSearchCV(estimator=mo_estimator,
cv=cv,
param_distributions=param_grid,
n_iter=10,
verbose=2,
scoring='neg_mean_squared_error',
n_jobs=int(grid_jobs/mo_jobs),
pre_dispatch=int(grid_jobs/mo_jobs))
grid.fit(X_train,y_train)
Note my cluster has 64 cores, and I am hitting this error only using 40 cores and only n_iter = 10 of RandomizedSearchCV
sc-learn v.0.22.1.
Similar situation. The program consumes little RAM.
self._clf = BaggingClassifier(base_estimator=algs_objs[0].clf, n_estimators=10, n_jobs = -1, bootstrap = True, max_samples=0.95) self._clf.fit(X, y)
Mitigation: n_jobs=1.
Important: my algorithms for BaggingClf use n_jobs=1, not -1.
I am also having this SIGABRT(-6) error as many have already posted here, but when I run the same notebook in Google Colab, I get the following:
/usr/local/lib/python3.6/dist-packages/joblib/externals/loky/process_executor.py:706: UserWarning: A worker stopped while some jobs were given to the executor. This can be caused by a too short worker timeout or by a memory leak.
"timeout or by a memory leak.", UserWarning
sc-learn v.0.22.1. Similar situation. The program consumes little RAM.
self._clf = BaggingClassifier(base_estimator=algs_objs[0].clf, n_estimators=10, n_jobs = -1, bootstrap = True, max_samples=0.95) self._clf.fit(X, y)
Mitigation: n_jobs=1. Important: my algorithms for BaggingClf use n_jobs=1, not -1.
very useful ! THX
I found out what the problem was for me! Since it was using multiple cores, as some time, it reached maximum RAM memory usage, leading the OS to terminate the processes unexpectedly. I just decreasead the number of cores used on parallelization for some models (the ones that demanded more memory) and it worked!
You can take a look at the OS log to see if you happen to be having such a problem.
Hope it helps you guys!
This is what made it for me. Turns out allocating all CPUs can be unstable, specially when there are other independent programs running that can suddenly have an uncontrolled spike in memory usage.
The full case with n_jobs
:
n_jobs = -1 # parallelization in all CPUs (until last element of the array of cpus, hence -1)
n_jobs = -2 # parallelization in all CPUs but 1 (until the previous element from the last, hence -2)
...
n_jobs = 1 # parallelization deactivated
So n_jobs = -2
did it for me and should be enough, and clearly more efficient than n_jobs = 1
.
EDIT: This is, however, only a nice workaround, not a fix, as @seanlseymour says below.
I'm seeing that I can avoid this issue for some classifiers by setting n_jobs to -2, however not all. For example LogisticRegression produces this error as does Bagging. RandomForest, SVC, KNeighborsClassifier, XGBoost work. The tracebacks on failures don't always point to the same place, consistent with the lack of consistency cited in this thread. Sometimes the issue is at cross_validate, sometimes at learning_curve, sometimes at GridSearchCV, or RandomizedSearchCV - all seem to be from sklearn.model_selection. The only other common theme I see is they all tracebacks hit python3.7/site-packages/joblib/parallel.py. I'm sure this issue did not happen before switching to Catalina, but I'm not sure it was triggered immediately, so perhaps something else or a combo is the problem. I'm really hoping someone who understands this much more deeply than I do will dig into this for a real fix. Even if n_jobs = -2 always worked, that's still just a workaround, not a fix, right? Any progress here greatly appreciated!
My config: OS Catalina 10.15.5 Python 3.7 Anaconda 4.4.7 (reinstalled per suggestions, no effect) scikit_learn 23.1 matplotlib 3.2.1 16 GB RAM (free RAM is never the actual issue as far as I can tell)
updating matplotlib did it for me:
pip install -U matplotlib
macOS Catalina 10.15.6 sklearn: 0.23.2 numpy: 1.19.1 scipy: 1.4.1 Cython: 0.29.21 pandas: 1.0.5 matplotlib: 3.3.1 joblib: 0.16.0 threadpoolctl: 2.1.0
I'm encountering the error
TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {EXIT(1)}
when running an instance of GridSearchCV on a DecisionTreeClassifier with n_jobs!=1. I tried updating sklearn and matplotlib with conda, but the problem persists. I am able to run RandomForestClassifier with n_jobs!=1 without any issue.
A workaround for now:
with parallel_backend('threading',n_jobs=8):
fitGridSearchDecisionTree(data,clf_args) #my code that calls instance of GridSearchCV.fit with n_jobs=None
This uses multithreading rather than multiprocessing (if I understand correctly) but it still results in a much faster execution of gridsearch.
I kept getting this error even with n_jobs=1. Turns out I found a hidden error:
--------------------------------------------------------------------------------
Traceback (most recent call last):
File "/home/steve/anaconda3/envs/rapidsai-0.17/lib/python3.7/site-packages/joblib/externals/loky/backend/popen_loky_posix.py", line 197, in <module>
prep_data = pickle.load(from_parent)
ValueError: unsupported pickle protocol: 5
My only workaround was to set LOKY_PICKLER='pickle' https://buildmedia.readthedocs.org/media/pdf/joblib/latest/joblib.pdf
I can't seem to find much info on this...anyone know why the default cloudpickle is using protocol 5? It appears that has to do with python 3.8, but I have 3.7.8 installed
If it helps, I am having this problem while trying to run multiple XGBoost models in parallel. I.e. I use joblib to read from disk multiple copies of an XGBoost model, which then consume incoming MQ messages to make predictions. I do not see high RAM usage in the system monitor (15-20% of RAM is used). The models start and run fine for some time, but at some moment I get a crash with the same error, i.e.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
File ".../Python-3.6.3/lib/python3.6/site-packages/joblib/parallel.py", line 930, in __call__
self.retrieve()
File ".../Python-3.6.3/lib/python3.6/site-packages/joblib/parallel.py", line 833, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File ".../Python-3.6.3/lib/python3.6/site-packages/joblib/_parallel_backends.py", line 521, in wrap_future_result
return future.result(timeout=timeout)
File ".../Python-3.6.3/lib/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File ".../Python-3.6.3/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGABRT(-6)}
In one test reproducing the problem, if I run 40 model in parallel - I get the crash, but if I run 30 models in parallel - the crash does not occur.
sc-learn v.0.22.1. Similar situation. The program consumes little RAM.
self._clf = BaggingClassifier(base_estimator=algs_objs[0].clf, n_estimators=10, n_jobs = -1, bootstrap = True, max_samples=0.95) self._clf.fit(X, y)
Mitigation: n_jobs=1. Important: my algorithms for BaggingClf use n_jobs=1, not -1.
Very helpful!!! Thank you
Is this issue fixed? I am facing a similar error with sklearn.grid_search.RandomizedSearchCV with n_jobs = 4. Number of cores=8 2 million rows of data.
Is this issue fixed? I am facing a similar error with sklearn.grid_search.RandomizedSearchCV with n_jobs = 4. Number of cores=8 2 million rows of data.
What kind of model are you searching? Keras model or sklearn model? If keras I suggest you could use keras tuner package for that.
Got a similar issue in AutoML that I'm working on. The solution was to update the joblib
package to 1.0.1
.
pip install -U joblib==1.0.1
I think we should close this issue. joblib workers can crash for a variety of reasons (e.g. not enough memory on the system to use parallelism, installation problems and so on) and we should open one issue per problem, provided we have enough information to reproduce the problem.
In the comments above, most reports are unrelated to the skope-rules
library and do not actually use it at all.
If you face such a problem in your code without importing skope-rules, please:
n_jobs=2
instead of n_jobs=-1
and monitor RAM usage again before growing the n_jobs
value;python -c "import sklearn; sklearn.show_versions()"
numpy.random
module. If you do not make the effort to provide us with a minimal reproducer it's very likely that nobody will be able to help you.A minimal reproducer should be small (e.g. no more than 20 lines of python) and stand-alone: anyone should be able to execute the code, for instance by copy and pasting the code snippet in a IPython or jupyter session.
I found out what the problem was for me! Since it was using multiple cores, at some time, it reached maximum RAM memory usage, leading the OS to terminate the processes unexpectedly. I just decreasead the number of cores used on parallelization for some models (the ones that demanded more memory) and it worked!
You can take a look at the OS log to see if you happen to be having such a problem.
Hope it helps you guys!
I tried something similar, wherein I set my regressor to have njobs = 4
while the grid-search is set to use almost all the CPUs available. Is this similar to what you did?
This issue still exists as of 2022. Closing the issue, and pretending it went away (or use njobs=1 for "parallelization") does not fix the issue. Demanding "minimal examples" when the issue shows up for complicated working code is also unreasonable. I understand this is hard to track bug, but the above "solutions" are not solutions.
Got the same error. There is a bug, hope the following helps:
192vCPU 786 GB Memory Canonical, Ubuntu, 22.04 LTS, amd64 jammy image build on 2022-06-09
scikit-learn==1.1.2 joblib==1.1.0 catboost==1.0.6 lightgbm==3.3.2 scipy==1.9.0 scikit-learn==1.1.2 scikit-optimize==0.9.0 filelock==3.8.0 progressbar2==4.0.0 numpy==1.23.2 pandas==1.4.3 tabulate==0.8.10 pycoingecko==2.2.0 jinja2==3.1.2 tables==3.7.0 blosc==1.10.6 joblib==1.1.0 python==3.10
File "/home/ubuntu/mmm/.env/lib/python3.10/site-packages/joblib/parallel.py", line 1056, in call self.retrieve() File "/home/ubuntu/mmm/.env/lib/python3.10/site-packages/joblib/parallel.py", line 935, in retrieve self._output.extend(job.get(timeout=self.timeout)) File "/home/ubuntu/mmm/.env/lib/python3.10/site-packages/joblib/_parallel_backends.py", line 542, in wrap_future_result return future.result(timeout=timeout) File "/usr/lib/python3.10/concurrent/futures/_base.py", line 446, in result return self.get_result() File "/usr/lib/python3.10/concurrent/futures/_base.py", line 391, in get_result raise self._exception joblib.externals.loky.process_executor.TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker.
The exit codes of the workers are {SIGKILL(-9)}
Had the same issue while running sklearn.model_selection.cross_validate
in PyCharm. I resolved it by increasing the heap memory of the IDE. For PyCharm it's 750 MiB by default which can trigger the TerminatedWorkerError especially when working with huge databases.
Hope this is helpful.
I keep running into a TerminatedWorkerError when running
clf.fit
with skope rules. I seem to have ample memory so I'm unsure what's going on. Any potential ideas?