bug during run - Githubissues

tmontana commented 4 years ago

Hi. Please see below crash during a long run. The system first crashed with a message saying a worker had crashed maybe because of a memory error. I don't think that was it as memory used was less than 15% of available. Running it again on same directory I get the following error: IndexError: single positional indexer is out-of-bounds

This is the output:

AutoML directory: ./../models/4B_FINAL_XGB//ensemble_Final_XGBoost_t6 The task is binary_classification with evaluation metric logloss AutoML will use algorithms: ['Xgboost'] AutoML will ensemble availabe models AutoML steps: ['simple_algorithms', 'default_algorithms', 'not_so_random', 'golden_features', 'insert_random_feature', 'features_selection', 'hill_climbing_1', 'hill_climbing_2', 'ensemble'] Skip simple_algorithms because no parameters were generated.

Step default_algorithms will try to check up to 1 model Skipping 1_Default_Xgboost, already trained.
Step not_so_random will try to check up to 4 models Skipping 2_Xgboost, already trained. Skipping 3_Xgboost, already trained. Skipping 4_Xgboost, already trained. Skipping 5_Xgboost, already trained.
Step golden_features will try to check up to 1 model Skipping 3_Xgboost_GoldenFeatures, already trained.

IndexError Traceback (most recent call last)
in

~/anaconda3/envs/mlj_shap_6/lib/python3.8/site-packages/supervised/automl.py in fit(self, X, y) 274 AutoML object: Returns self 275 """ --> 276 return self._fit(X, y) 277 278 def predict(self, X):

~/anaconda3/envs/mlj_shap_6/lib/python3.8/site-packages/supervised/base_automl.py in _fit(self, X, y) 667 668 except Exception as e: --> 669 raise e 670 finally: 671 if self._X_path is not None:

~/anaconda3/envs/mlj_shap_6/lib/python3.8/site-packages/supervised/base_automl.py in _fit(self, X, y) 622 generated_params = self._all_params[step] 623 else: --> 624 generated_params = tuner.generate_params( 625 step, self._models, self._results_path, self._stacked_models 626 )

~/anaconda3/envs/mlj_shap_6/lib/python3.8/site-packages/supervised/tuner/mljar_tuner.py in generate_params(self, step, models, results_path, stacked_models) 99 return self.get_golden_features_params(models, results_path) 100 elif step == "insert_random_feature": --> 101 return self.get_params_to_insert_random_feature(models) 102 elif step == "features_selection": 103 return self.get_feature_selection_params(models, results_path)

~/anaconda3/envs/mlj_shap_6/lib/python3.8/site-packages/supervised/tuner/mljar_tuner.py in get_params_to_insert_random_feature(self, current_models) 433 df_models.sort_values(by="score", ascending=True, inplace=True) 434 --> 435 m = df_models.iloc[0]["model"] 436 437 params = copy.deepcopy(m.params)

~/anaconda3/envs/mlj_shap_6/lib/python3.8/site-packages/pandas/core/indexing.py in getitem(self, key) 877 878 maybe_callable = com.apply_if_callable(key, self.obj) --> 879 return self._getitem_axis(maybe_callable, axis=axis) 880 881 def _is_scalar_access(self, key: Tuple):

~/anaconda3/envs/mlj_shap_6/lib/python3.8/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis) 1494 1495 # validate the location -> 1496 self._validate_integer(key, axis) 1497 1498 return self.obj._ixs(key, axis=axis)

~/anaconda3/envs/mlj_shap_6/lib/python3.8/site-packages/pandas/core/indexing.py in _validate_integer(self, key, axis) 1435 len_axis = len(self.obj._get_axis(axis)) 1436 if key >= len_axis or key < -len_axis: -> 1437 raise IndexError("single positional indexer is out-of-bounds") 1438 1439 # -------------------------------------------------------------------

IndexError: single positional indexer is out-of-bounds

tmontana commented 4 years ago

training parameters:

model_types = ["Xgboost"] automl = AutoML( results_path=experiment_name, total_time_limit=600 * 10, model_time_limit=600, algorithms=model_types, golden_features=True, feature_selection=True, train_ensemble=True, explain_level=0, stack_models=False, validation_strategy={ "validation_type": "kfold", "k_folds": 3, "shuffle": False, "stratify": True, }, mode="Perform" )

tmontana commented 4 years ago

This is the actual error that caused the crash in the first place:

Created 14 Golden Features in 20.29 seconds. 3_Xgboost_GoldenFeatures logloss 0.632775 trained in 475.53 seconds

Step insert_random_feature will try to check up to 1 model

TerminatedWorkerError Traceback (most recent call last)
in

~/anaconda3/envs/mlj_6/lib/python3.8/site-packages/supervised/automl.py in fit(self, X, y) 274 AutoML object: Returns self 275 """ --> 276 return self._fit(X, y) 277 278 def predict(self, X):

~/anaconda3/envs/mlj_6/lib/python3.8/site-packages/supervised/base_automl.py in _fit(self, X, y) 667 668 except Exception as e: --> 669 raise e 670 finally: 671 if self._X_path is not None:

~/anaconda3/envs/mlj_6/lib/python3.8/site-packages/supervised/base_automl.py in _fit(self, X, y) 654 trained = self.ensemble_step(is_stacked=params["is_stacked"]) 655 else: --> 656 trained = self.train_model(params) 657 658 params["status"] = "trained" if trained else "skipped"

~/anaconda3/envs/mlj_6/lib/python3.8/site-packages/supervised/base_automl.py in train_model(self, params) 227 f"Train model #{len(self._models)+1} / Model name: {params['name']}" 228 ) --> 229 mf.train(model_path) 230 231 # save the model

~/anaconda3/envs/mlj_6/lib/python3.8/site-packages/supervised/model_framework.py in train(self, model_path) 166 self.callbacks.on_learner_train_end() 167 --> 168 learner.interpret( 169 X_train, 170 y_train,

~/anaconda3/envs/mlj_6/lib/python3.8/site-packages/supervised/algorithms/algorithm.py in interpret(self, X_train, y_train, X_validation, y_validation, model_file_path, learner_name, target_name, class_names, metric_name, ml_task, explain_level) 70 return 71 if explain_level > 0: ---> 72 PermutationImportance.compute_and_plot( 73 self, 74 X_validation,

~/anaconda3/envs/mlj_6/lib/python3.8/site-packages/supervised/utils/importance.py in compute_and_plot(model, X_validation, y_validation, model_file_path, learner_name, metric_name, ml_task) 51 with warnings.catch_warnings(): 52 warnings.simplefilter("ignore") ---> 53 importance = permutation_importance( 54 model, 55 X_validation,

~/anaconda3/envs/mlj_6/lib/python3.8/site-packages/sklearn/utils/validation.py in inner_f(*args, kwargs) 70 FutureWarning) 71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)}) ---> 72 return f(kwargs) 73 return inner_f 74

~/anaconda3/envs/mlj_6/lib/python3.8/site-packages/sklearn/inspection/_permutation_importance.py in permutation_importance(estimator, X, y, scoring, n_repeats, n_jobs, random_state) 133 baseline_score = scorer(estimator, X, y) 134 --> 135 scores = Parallel(n_jobs=n_jobs)(delayed(_calculate_permutation_scores)( 136 estimator, X, y, col_idx, random_seed, n_repeats, scorer 137 ) for col_idx in range(X.shape[1]))

~/anaconda3/envs/mlj_6/lib/python3.8/site-packages/joblib/parallel.py in call(self, iterable) 1015 1016 with self._backend.retrieval_context(): -> 1017 self.retrieve() 1018 # Make sure that we get a last message telling us we are done 1019 elapsed_time = time.time() - self._start_time

~/anaconda3/envs/mlj_6/lib/python3.8/site-packages/joblib/parallel.py in retrieve(self) 907 try: 908 if getattr(self._backend, 'supports_timeout', False): --> 909 self._output.extend(job.get(timeout=self.timeout)) 910 else: 911 self._output.extend(job.get())

~/anaconda3/envs/mlj_6/lib/python3.8/site-packages/joblib/_parallel_backends.py in wrap_future_result(future, timeout) 560 AsyncResults.get from multiprocessing.""" 561 try: --> 562 return future.result(timeout=timeout) 563 except LokyTimeoutError: 564 raise TimeoutError()

~/anaconda3/envs/mlj_6/lib/python3.8/concurrent/futures/_base.py in result(self, timeout) 437 raise CancelledError() 438 elif self._state == FINISHED: --> 439 return self.__get_result() 440 else: 441 raise TimeoutError()

~/anaconda3/envs/mlj_6/lib/python3.8/concurrent/futures/_base.py in __get_result(self) 386 def __get_result(self): 387 if self._exception: --> 388 raise self._exception 389 else: 390 return self._result

TerminatedWorkerError: A worker process managed by the executor was unexpectedly terminated. This could be caused by a segmentation fault while calling the function or by an excessive memory usage causing the Operating System to kill the worker. The exit codes of the workers are {SIGKILL(-9), SIGKILL(-9)}

pplonski commented 4 years ago

Looks like there might be 2 bugs. I will try to reproduce it with random data.

pplonski commented 4 years ago

The first problem:

I've added a catch for exceptions in the permutation-importance compute method. If there is an exception we just skip the computation of importance and continue the training
I've searched for similar problems in sklearn: https://github.com/scikit-learn-contrib/skope-rules/issues/18 and https://stackoverflow.com/questions/54139403/how-do-i-fix-debug-this-multi-process-terminated-worker-error-thrown-in-scikit-l It looks like it can be some problem with scipy installation. The Tensorflow has a dependency on old scipy (1.4.1), and new scikit-learn (0.23.2) requires newest scipy(1.5.2) and there might be a problem during the installation of mljar-supervised. Please see the https://github.com/mljar/mljar-supervised/issues/167 I won't be surprised if these issues will be connected.

pplonski commented 4 years ago

The second problem was the missing load of already trained models after training restore. I fixed this.

All changes will be added to the next release 0.7.2.

I'm closing the issue. If you will observe anything suspected, please reopen it.

mljar / mljar-supervised

bug during run #185

Step golden_features will try to check up to 1 model Skipping 3_Xgboost_GoldenFeatures, already trained.

Step insert_random_feature will try to check up to 1 model