Memory usege during training

mljar / mljar-supervised

Python package for AutoML on Tabular Data with Feature Engineering, Hyper-Parameters Tuning, Explanations and Automatic Documentation

https://mljar.com

MIT License

3k stars 401 forks source link

Memory usege during training #381

Open RafaD5 opened 3 years ago

RafaD5 commented 3 years ago

Hi,

I've trained several models with mode="Perform" and when the training gets to certain point the python execution is killed because of the memory usage (I'm using a computer with 16 GB). What I do is to rerun the script and change the model_name to the name of the model just created to resume training. A couple of times I've had to repeat this process twice. It is not due to a single model but to data from previous models (already trained) that is not eliminated from memory.

pplonski commented 3 years ago

Hey @RafaD5! Looks like a bug. I'm pretty sure that data between different folds and models should be cleared. Do you observe the same behavior for the Compete mode? You can set validation_strategy={"validation_type" "kfold", "k_folds": 5, "shuffle": True} to have the same CV as in Perform mode.

xuzhang5788 commented 3 years ago

@pplonski I met the same situation several times. There is a memory leak.

pplonski commented 3 years ago

@xuzhang5788 was it for Perform mode or other?

xuzhang5788 commented 3 years ago

Compete and Optuna mode. My case is like, in one notebook, if I noticed that the memory accumulated after I ran several automl.fit, then the kernal got killed. I have to restart my kernal every new training.

pplonski commented 3 years ago

@xuzhang5788 thank you, I will work on it. Any help appreciated! :)

pplonski commented 3 years ago

@RafaD5 @xuzhang5788 I made few changes:

the data is not stored in the files during AutoML, I just keep a copy of the data. It looks like during save/load data to files there were some data copies created and not cleared. Storing directly in RAM should be faster and will use less memory because no leaks during save/load.
I added everywhere direct del statements on datasets and gc.collect().
I added an issue for LightGBM because looks like it is not deleting the memory after the training and del.

All changes are in the dev branch. You can install it:

pip install -q -U git+https://github.com/mljar/mljar-supervised.git@dev

I'm looking for your feedback! Thank you!

xuzhang5788 commented 3 years ago

It looks like that it was not improved a lot. I still can see that the memory was occupied gradually.

pplonski commented 3 years ago

@xuzhang5788 yes, it is not fixed 100%. It should be slightly better and maybe not cause crashes. Looks like algorithms not from sklearn package doesn't release memory properly.

I will try to run ML training in separate processes, maybe this will help, but on the other hand I dont want to make over-complex code.

brickfrog commented 3 years ago

This is still an issue, correct? I'm curious since I've been tinkering with mljar for numerai competitions. Seem to run out of memory - would run for 14 hours overnight and wake up to stalled computer (I have 64GB)

pplonski commented 3 years ago

@BrickFrog yes, it is still an issue.

pplonski commented 3 years ago

@BrickFrog have you used custom eval_metric when using AutoML on numerai data? It is possible to pass custom eval_metric like sharpe ratio to be optimized. There is also Spearman correlation built-in as eval_metric in MLJAR. Sorry if you couldnt find it in the docs. Please add github issue and I will fix the docs.

It is also possible to set-up custom validation strategy, by passing defined train/validation indices for each fold.

I have a plan to add tutorial/examples how mljar-supervised can be used with numerai data.

What is more, we are working on visual-notebook. It will be a desktop application for data science where user can click-out the solution, without heavy coding. I attach the screenshot (very development version). I would add blocks for numerai there (get latest data, upload submission). Screenshot from 2021-05-31 08-31-33

sumanttyagi commented 2 years ago

while using "Compete"mode similar issues is still being faced While using "AutoML_class_obj = AutoML(data=data,mode ="Compete",eval_metric = "r2") using in compete mode with around 9998 training samples/records.Either it is getting crashed or it goes on with too many python programs running in task manager. 1.UserWarning:MiniBatchKMeans is known to have a memory leak on Windows with MKL, when there are less chunks than available threads. You can prevent it by setting batch_size >= 9216 or by setting the environment variable OMP_NUM_THREADS=1 2.OSError: [WinError 1455] The paging file is too small for this operation to complete

pplonski commented 2 years ago

@sumanttyagi thank you for reporting. I understand that you are on the Windows system. Could you please post the full code with the data sample to reproduce the issue? Is it possible?