microsoft / nni

An open source AutoML toolkit for automate machine learning lifecycle, including feature engineering, neural architecture search, model compression and hyper-parameter tuning.
https://nni.readthedocs.io
MIT License
13.99k stars 1.81k forks source link

GPU usage via NNI is different from running programs separately. #5522

Closed seanswyi closed 1 year ago

seanswyi commented 1 year ago

Describe the issue: I was running a script with trial_gpu_number: 1 and trial_concurrency: 5. I noticed that all of my trials were failing due to CUDA out of memory errors.

However, when I run the same trials separately (i.e., with the same hyperparameters but simply by doing python ./main.py) it works fine.

Is there something that's using GPU memory that I'm not aware of?

Environment:

Configuration:

experiment_name: resnest_hpo
search_space_file: search_space.json
use_annotation: False

trial_command: bash ./scripts/resnest_nni_hpo.sh
# trial_command: bash ./scripts/resnest_debug.sh
trial_gpu_number: 1
trial_concurrency: 5

max_experiment_duration: 15h
max_trial_number: 500

tuner:
  name: TPE
  class_args:
    optimize_mode: maximize

training_service:
  platform: local
  use_active_gpu: True
{
    "lr": {"_type": "choice", "_value": [0.0001, 0.0003, 0.0005, 0.001, 0.003, 0.005]},
    "epochs": {"_type": "choice", "_value": [30, 50, 100, 150]},
    "optim_type": {"_type": "choice", "_value": ["sgd", "adam"]},
    "batch_size": {"_type": "choice", "_value": [32, 64, 128, 256, 512]}
}

Log message:

How to reproduce it?:

liuzhe-lz commented 1 year ago

There should be a run.sh or run.ps1 file in nni-experiments/<experiment-id>/trials/<trial-id>. Could you try to run the trial with that script? And by "run the same trials separately", did you run 5 trials concurrently or run them one by one? What's the output of nvidia-smi?

Lijiaoa commented 1 year ago

@seanswyi any updates?

seanswyi commented 1 year ago

I think that there was an error in my script. I deleted the entire thing and tried again and it's working now. Sorry and thanks.