mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation
https://mljar.com
MIT License
3.04k stars 406 forks source link

Memory Issues with "Explain" #465

Open deetungsten opened 3 years ago

deetungsten commented 3 years ago

Hello,

I am not sure if this is related to #381 but I have memory issues when I use "Explain". The main symptom is that my RAM clocks out at max (16GB) and my swap (2GB). My computer at this point become unresponsive and I would need to do a hard reset. Every other mode works (Compete, Perform, Optuna) and I have been getting really good results. I wanted to try pruning some of the features but have not been able to find the most important features due to this issue.

Thanks!

Dee

pplonski commented 3 years ago

@deetungsten thank you for reporting the issue.

What operating system are you using? Can you send to code that you used? Ideally with data or some description about your data (number of columns and number of rows). Do you have some output before the crash? Any idea when (before feature selection or after) there is a problem?

If other modes are working then Explain should work as well. In the Explain mode, the number of models trained is the lowest comparing to the other modes - so the RAM usage should be the lowest as well.

deetungsten commented 3 years ago

Thank you for the prompt reply! I am current using Ubuntu 20.04, conda + python 3.8 environment, 16GB of RAM, 512 GB SSD, and an older i7-4700. I run all the code on a jupyter notebook. My dataset is 74497 rows × 58 columns with a train/test split of 0.30, all processed using pandas and scikit. All floating numbers with reasonable floating point range.

There is unfortunately no meaningful output. The output folder has one folder called EDA and one image that shows the number of 0 and 1 in the binary classification. I start the run using automl = AutoML(mode="Explain") and it does the usual warmup output.

Linear algorithm was disabled. AutoML directory: AutoML_1 The task is binary_classification with evaluation metric logloss AutoML will use algorithms: ['Baseline', 'Decision Tree', 'Random Forest', 'Xgboost', 'Neural Network'] AutoML will ensemble availabe models

I use htop to monitor the performance and it seems like one core on the CPU fires at a time. The RAM usage slowly creeps ups to max before filling up the swap as well and then the computer becomes unresponsive and requires a hard reset.

The weird thing like you mentioned is that all the other mode works fine even though all 8 cores fire up during the run. The RAM never goes beyond around 13GB on the other modes.

Thanks!

Dee

deetungsten commented 3 years ago

After some trial and error, a major symptom is that the memory leaks when explain_level=2. Every other type of explain_level works fine so it might be the SHAP portion that is eating up all the memory (and not releasing?). I haven't stared at it long enough to be confident but I believe its being held up when the MljarTuner class is called.

pplonski commented 3 years ago

@deetungsten in your example, do you only see one EDA folder, no other folders? Maybe there is something in EDA code that eats all memory? How many files do you have in EDA folder? Do you heave one file per each column?

leocd91 commented 3 years ago

yes, can confirm, mine stops too using explain_level=2 , other than that is fine. only EDA folder and stop at making a distribution plot on each columns..

pplonski commented 3 years ago

@leocd91 thanks for reporting.

I'm considering removing EDA step from AutoML. I'm using pandas_profiling for EDA (personally and in MLJAR Studio. What do you think @leocd91 @deetungsten? The pandas_profiling package is more advanced in EDA and it is better to focus on ML model building in AutoML.