Closed tgaddair closed 2 years ago
Running on a fixed-size 3-node ray cluster; each instance is g4dn.4xlarge. Executing this script: https://github.com/ludwig-ai/experiments/blob/main/automl/validation/higgs/run_auto_train_1hr.py The time-based Ray Tune hyperparameter search completes and the above OOM occurs during the post-search evaluation step.
Note that the same thing happens for forest cover: https://github.com/ludwig-ai/experiments/blob/main/automl/validation/forest_cover/run_auto_train_1hr.py
Running with ToT master plus this PR https://github.com/ludwig-ai/ludwig/pull/1638
Latest run shows problem has been addressed.
This occurs when using AutoML on the Higgs dataset with PyTorch:
cc @anneholler for repro script and other details.