ludwig-ai / ludwig

Low-code framework for building custom LLMs, neural networks, and other AI models
http://ludwig.ai
Apache License 2.0
11.17k stars 1.19k forks source link

CUDA out of memory error during batch evaluation #1648

Closed tgaddair closed 2 years ago

tgaddair commented 2 years ago

This occurs when using AutoML on the Higgs dataset with PyTorch:

Traceback (most recent call last):
  File "run_auto_train_1hr.py", line 13, in <module>
    tune_for_memory=False
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/automl/automl.py", line 103, in auto_train
    return train_with_config(dataset, config, output_directory=output_directory, random_seed=random_seed, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/automl/automl.py", line 172, in train_with_config
    config, dataset, output_directory=output_directory, model_name=model_name, random_seed=random_seed, **kwargs
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/automl/automl.py", line 262, in _train
    **kwargs,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/hyperopt/run.py", line 331, in hyperopt
    **kwargs,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/hyperopt/execution.py", line 740, in execute
    debug=debug,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/api.py", line 833, in evaluate
    collect_predictions=collect_predictions or collect_overall_stats,
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/models/predictor.py", line 174, in batch_evaluation
    preds = self.model.evaluation_step(inputs, targets)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/models/ecd.py", line 187, in evaluation_step
    predictions = self.predictions(inputs, output_features=None)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/models/ecd.py", line 178, in predictions
    outputs = self(inputs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/models/ecd.py", line 133, in forward
    combiner_outputs = self.combiner(encoder_outputs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/combiners/combiners.py", line 491, in forward
    hidden, aggregated_mask, masks = self.tabnet(hidden)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/modules/tabnet_modules.py", line 111, in forward
    x = self.feature_transforms[0](masked_features)  # [b_s, s + o_s]
  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/modules/tabnet_modules.py", line 314, in forward
    hidden = (self.blocks[n](hidden) + hidden) * (0.5 ** 0.5)  # [b_s, s]
  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/modules/tabnet_modules.py", line 201, in forward
    hidden = glu(hidden)  # [bs, s]
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ludwig/modules/activation_modules.py", line 12, in glu
    return nn.functional.glu(x, dim)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1335, in glu
    return torch._C._nn.glu(input, dim)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 13.55 GiB already allocated; 19.75 MiB free; 13.65 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

cc @anneholler for repro script and other details.

amholler commented 2 years ago

Running on a fixed-size 3-node ray cluster; each instance is g4dn.4xlarge. Executing this script: https://github.com/ludwig-ai/experiments/blob/main/automl/validation/higgs/run_auto_train_1hr.py The time-based Ray Tune hyperparameter search completes and the above OOM occurs during the post-search evaluation step.

Note that the same thing happens for forest cover: https://github.com/ludwig-ai/experiments/blob/main/automl/validation/forest_cover/run_auto_train_1hr.py

amholler commented 2 years ago

Running with ToT master plus this PR https://github.com/ludwig-ai/ludwig/pull/1638

amholler commented 2 years ago

Latest run shows problem has been addressed.