GPU usage via NNI is different from running programs separately.

seanswyi commented 1 year ago

Describe the issue: I was running a script with trial_gpu_number: 1 and trial_concurrency: 5. I noticed that all of my trials were failing due to CUDA out of memory errors.

However, when I run the same trials separately (i.e., with the same hyperparameters but simply by doing python ./main.py) it works fine.

Is there something that's using GPU memory that I'm not aware of?

Environment:

NNI version:
Training service (local|remote|pai|aml|etc): local
Client OS: Ubuntu
Server OS (for remote mode only):
Python version: Python 3.10.9
PyTorch/TensorFlow version: PyTorch 1.12.1
Is conda/virtualenv/venv used?: Yes, conda is used.
Is running in Docker?: No.

Configuration:

Experiment config (remember to remove secrets!):

experiment_name: resnest_hpo
search_space_file: search_space.json
use_annotation: False

trial_command: bash ./scripts/resnest_nni_hpo.sh
# trial_command: bash ./scripts/resnest_debug.sh
trial_gpu_number: 1
trial_concurrency: 5

max_experiment_duration: 15h
max_trial_number: 500

tuner:
  name: TPE
  class_args:
    optimize_mode: maximize

training_service:
  platform: local
  use_active_gpu: True

Search space:

{
    "lr": {"_type": "choice", "_value": [0.0001, 0.0003, 0.0005, 0.001, 0.003, 0.005]},
    "epochs": {"_type": "choice", "_value": [30, 50, 100, 150]},
    "optim_type": {"_type": "choice", "_value": ["sgd", "adam"]},
    "batch_size": {"_type": "choice", "_value": [32, 64, 128, 256, 512]}
}

Log message:

nnimanager.log:
dispatcher.log:
nnictl stdout and stderr:

How to reproduce it?:

liuzhe-lz commented 1 year ago

There should be a run.sh or run.ps1 file in nni-experiments/<experiment-id>/trials/<trial-id>. Could you try to run the trial with that script? And by "run the same trials separately", did you run 5 trials concurrently or run them one by one? What's the output of nvidia-smi?

Lijiaoa commented 1 year ago

@seanswyi any updates?

seanswyi commented 1 year ago

I think that there was an error in my script. I deleted the entire thing and tried again and it's working now. Sorry and thanks.

microsoft / nni

GPU usage via NNI is different from running programs separately. #5522