ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
10.97k stars 1.18k forks source link

4/5 trial fails due to lack of memory #4010

Open diegotxegp opened 1 month ago

diegotxegp commented 1 month ago

Describe the bug 4/5 trials fail due to lack of memory. I have 4 x GPUs RTX 2080 Super (8 GB) and 64 GB RAM but it seems that AutoML doesn't recognize my GPUs to make the most.

To Reproduce Using the "Rotten Tomatoes" example from the Ludwig AI web. If you have more than one GPUs, you will be able to reproduce this error.

from ludwig.automl import auto_train auto_train_results = auto_train( dataset=self.df, target="recommended", time_limit_s=7200, )

Expected behavior Run the 5 trials with different results. No only one execution with 4 error due to lack of memory.

Screenshots

Trial trial_78e53127 completed after 11 iterations at 2024-05-28 13:24:32. Total running time: 21min 24s

Trial status: 4 ERROR | 1 TERMINATED Current time: 2024-05-28 13:24:32. Total running time: 21min 24s Logical resource usage: 0/20 CPUs, 1.0/4 GPUs (0.0/1.0 accelerator_type:G) Current best trial: 78e53127 with metric_score=0.9420865774154663 and params={'trainer.learning_rate': 2.2103375806114728e-05, 'trainer.batch_size': 64, 'combiner.num_fc_layers': 1, 'combiner.output_size': 128, 'combiner.dropout': 0.012855425737772442} ╭─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Trial name status ...ner.learning_rate trainer.batch_size ...ner.num_fc_layers combiner.output_size combiner.dropout iter total time (s) metric_score │ ├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ trial_78e53127 TERMINATED 2.21034e-05 64 1 128 0.0128554 11 1259.7 0.942087 │ │ trial_6a7803f9 ERROR 3.10601e-05 1024 3 256 0.0093055 │ │ trial_6950b4ec ERROR 0.000337902 1024 2 256 0.086701 │ │ trial_6225efbb ERROR 0.000705436 1024 1 128 0.0393212 │ │ trial_5d372a47 ERROR 0.000517778 1024 3 128 0.0782563 │ ╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Number of errored trials: 4 ╭──────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Trial name # failures error file │ ├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ trial_6a7803f9 1 /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6a7803f9/error.txt │ │ trial_6950b4ec 1 /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6950b4ec/error.txt │ │ trial_6225efbb 1 /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_6225efbb/error.txt │ │ trial_5d372a47 1 /home/diego/VSProjects/Rotten-Tomatoes/hyperopt/trial_5d372a47/error.txt │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────╯

2024-05-28 13:24:32,620 ERROR tune.py:1144 -- Trials did not complete: [trial_6a7803f9, trial_6950b4ec, trial_6225efbb, trial_5d372a47] 2024-05-28 13:24:32,631 WARNING experiment_analysis.py:916 -- Failed to read the results for 4 trials:

Environment (please complete the following information):

Additional context The idea is using AutoML for its ease to autoconfig.

arnavgarg1 commented 1 month ago

Hey @diegotxegp,

Are you able to try setting max_concurrent_trials to a value like 1 or 2? https://ludwig.ai/latest/configuration/hyperparameter_optimization/#executor

Regarding GPU usage - is your CUDA_VISIBLE_DEVICES environment variable set?

diegotxegp commented 1 month ago

Thank you for your quick response.

The point is that I am trying it automatically with AutoML. Since the error raised, I added "os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3"" as you said ,and regarding "max_current_trials", I set it like as follows, but with not much difference:

Code:

from ludwig.automl import auto_train auto_train_results = auto_train( dataset=self.df, target=selected_targets[0], time_limit_s=7200, num_samples=4, cpu_resources_per_trial=5, gpu_resources_per_trial=1, max_concurrent_trials=1, )

AutoML config:

{ 'eval_split': 'validation', 'executor': { 'cpu_resources_per_trial': 5, 'gpu_resources_per_trial': 1, 'kubernetes_namespace': None, 'max_concurrent_trials': None, 'num_samples': 5, 'scheduler': { 'brackets': 1, 'grace_period': 72, 'max_t': 7200, 'metric': None, 'mode': None, 'reduction_factor': 5.0, 'stop_last_trials': True, 'time_attr': 'time_total_s', 'type': 'async_hyperband'}, 'time_budget_s': 7200, 'trial_driver_resources': {'CPU': 1, 'GPU': 0}, 'type': 'ray'}, 'goal': 'maximize', 'metric': 'roc_auc', 'output_feature': 'recommended', 'parameters': { 'combiner.dropout': { 'lower': 0.0, 'space': 'uniform', 'upper': 0.1}, 'combiner.num_fc_layers': { 'lower': 1, 'space': 'randint', 'upper': 4}, 'combiner.output_size': { 'categories': [128, 256], 'space': 'choice'}, 'trainer.batch_size': { 'categories': [ 64, 128, 256, 512, 1024], 'space': 'choice'}, 'trainer.learning_rate': { 'lower': 2e-05, 'space': 'loguniform', 'upper': 0.001}}, 'search_alg': {'type': 'hyperopt'}, 'split': 'validation'}