mlcommons / algorithmic-efficiency

MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models.
https://mlcommons.org/en/groups/research-algorithms/
Apache License 2.0
315 stars 59 forks source link

LibriSpeech Conformer Workload OOMs or NCCL Errors When Run With Multiple Trials #663

Open hjmshi opened 4 months ago

hjmshi commented 4 months ago

We consistently observe an OOM error when running the one of the NAdamW baselines on LibriSpeech Conformer with multiple trials in PyTorch on 8 V100s with 16GB each. This is run for the external ruleset. The first trial will successfully run through, but any subsequent trial will OOM.

If we try to resume a multi-trial run, we will observe a NCCL error. This occurs even if we delete the trial_2 folder (but the trial_1 folder remains intact).

Description

As discussed above, we will observe OOM when running LibriSpeech Conformer with the NAdamW baseline with multiple trials on 8 V100s with 16GB each. This is an example of an OOM we observe on the subsequent trial:

I0229 11:25:52.179392 139670264227648 submission_runner.py:314] Starting training loop.                                                [1007/1887]
I0229 11:25:54.777722 139642405861120 logging_writer.py:48] [0] global_step=0, grad_norm=56.284924, loss=31.354658                                
I0229 11:25:54.783623 139670264227648 pytorch_nadamw_full_budget.py:296] 0) loss = 31.355, grad_norm = 56.285                                     
I0229 11:25:55.234757 139670264227648 spec.py:321] Evaluating on the training split.                                                              
I0229 11:26:08.470401 139670264227648 spec.py:333] Evaluating on the validation split.                                                            
I0229 11:26:20.220777 139670264227648 spec.py:349] Evaluating on the test split.                                                                  
I0229 11:26:26.181465 139670264227648 submission_runner.py:414] Time since start: 34.00s,       Step: 1,        {'train/ctc_loss': 30.998378480849
826, 'train/wer': 1.253173088944333, 'validation/ctc_loss': 29.436887863844042, 'validation/wer': 1.157427702409115, 'validation/num_examples': 53
48, 'test/ctc_loss': 29.538330745441247, 'test/wer': 1.1866837283935572, 'test/num_examples': 2472, 'score': 2.605729579925537, 'total_duration': 
34.00228404998779, 'accumulated_submission_time': 2.605729579925537, 'accumulated_eval_time': 30.946528434753418, 'accumulated_logging_time': 0}  
I0229 11:26:26.201718 139642405861120 logging_writer.py:48] [1] accumulated_eval_time=30.946528, accumulated_logging_time=0, accumulated_submissio
n_time=2.605730, global_step=1, preemption_count=0, score=2.605730, test/ctc_loss=29.538331, test/num_examples=2472, test/wer=1.186684, total_dura
tion=34.002284, train/ctc_loss=30.998378, train/wer=1.253173, validation/ctc_loss=29.436888, validation/num_examples=5348, validation/wer=1.157428
I0229 11:26:26.931042 139670264227648 checkpoint_utils.py:240] Saved checkpoint to /experiment_runs/nadamw_test/librispeech_conformer_pytorch/tria
l_4/checkpoint_1.                                                                                                                                 
I0229 11:26:28.782347 139642397468416 logging_writer.py:48] [1] global_step=1, grad_norm=60.090302, loss=31.114534                                
I0229 11:26:28.785351 139670264227648 pytorch_nadamw_full_budget.py:296] 1) loss = 31.115, grad_norm = 60.090                                     
I0229 11:26:30.031821 139642405861120 logging_writer.py:48] [2] global_step=2, grad_norm=72.279572, loss=30.439383                                
I0229 11:26:30.035584 139670264227648 pytorch_nadamw_full_budget.py:296] 2) loss = 30.439, grad_norm = 72.280                                     
I0229 11:26:30.954263 139642397468416 logging_writer.py:48] [3] global_step=3, grad_norm=108.894348, loss=29.346493                               
I0229 11:26:30.957295 139670264227648 pytorch_nadamw_full_budget.py:296] 3) loss = 29.346, grad_norm = 108.894                                    
Traceback (most recent call last):                                                                                                                
  File "submission_runner.py", line 697, in <module>                                                                                              
    app.run(main)                                                                                                                                 
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "submission_runner.py", line 665, in main
    score = score_submission_on_workload(
  File "submission_runner.py", line 576, in score_submission_on_workload
    timing, metrics = train_once(workload, workload_name,
  File "submission_runner.py", line 336, in train_once
    optimizer_state, model_params, model_state = update_params(
  File "/algorithmic-efficiency/submissions/baseline_submission/pytorch_nadamw_full_budget.py", line 276, in update_params
    loss.backward()
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.51 GiB. GPU 4 has a total capacty of 15.77 GiB of which 865.31 MiB is free. P
rocess 443130 has 14.92 GiB memory in use. Of the allocated memory 6.25 GiB is allocated by PyTorch, and 7.69 GiB is reserved by PyTorch but unall
ocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management
 and PYTORCH_CUDA_ALLOC_CONF

Alternatively, if we try to resume the multi-trial run, we will observe the NCCL error:

I0225 07:05:33.506379 140182089410368 submission_runner.py:589] Timing: 61068.13836145401                        
I0225 07:05:33.506450 140182089410368 submission_runner.py:591] Total number of evals: 47                        
I0225 07:05:33.506514 140182089410368 submission_runner.py:592] ====================                             
I0225 07:05:33.506577 140182089410368 submission_runner.py:545] Using RNG seed 1817859550                        
I0225 07:05:33.507572 140182089410368 submission_runner.py:554] --- Tuning run 2/5 ---                           
I0225 07:05:33.507675 140182089410368 submission_runner.py:559] Creating tuning directory at /experiment_runs/nad
amw_baseline/librispeech_conformer_pytorch/trial_2.                                                              
I0225 07:05:33.507940 140182089410368 logger_utils.py:92] Saving hparams to /experiment_runs/nadamw_baseline/libr
ispeech_conformer_pytorch/trial_2/hparams.json.                                                                  
I0225 07:05:33.508602 140182089410368 submission_runner.py:206] Initializing dataset.                            
I0225 07:05:33.508736 140182089410368 input_pipeline.py:20] Loading split = train-clean-100                      
I0225 07:05:33.534054 140182089410368 input_pipeline.py:20] Loading split = train-clean-360                      
I0225 07:05:33.866440 140182089410368 input_pipeline.py:20] Loading split = train-other-500
I0225 07:05:34.242223 140182089410368 submission_runner.py:213] Initializing model.
[E ProcessGroupNCCL.cpp:474] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=BRO
ADCAST, NumelIn=4918, NumelOut=4918, Timeout(ms)=1800000) ran for 1800075 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=BRO
ADCAST, NumelIn=4918, NumelOut=4918, Timeout(ms)=1800000) ran for 1800070 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=BRO
ADCAST, NumelIn=4918, NumelOut=4918, Timeout(ms)=1800000) ran for 1800128 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=BRO
ADCAST, NumelIn=4918, NumelOut=4918, Timeout(ms)=1800000) ran for 1800597 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=BRO
ADCAST, NumelIn=4918, NumelOut=4918, Timeout(ms)=1800000) ran for 1800573 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=5, OpType=BRO
ADCAST, NumelIn=4918, NumelOut=4918, Timeout(ms)=1800000) ran for 1800614 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUD
A kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught co
llective operation timeout: WorkNCCL(SeqNum=5, OpType=BROADCAST, NumelIn=4918, NumelOut=4918, Timeout(ms)=1800000
) ran for 1800128 milliseconds before timing out.       
terminate called after throwing an instance of 'std::runtime_error'                                              
  what():  [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation
 timeout: WorkNCCL(SeqNum=5, OpType=BROADCAST, NumelIn=4918, NumelOut=4918, Timeout(ms)=1800000) ran for 1800128 
milliseconds before timing out.                         
Fatal Python error: Aborted 

cc @anana10c @mikerabbat @tsunghsienlee @yuchenhao @shintaro-iwasaki

Steps to Reproduce

In the Docker container, run:

torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 \
    --standalone \
    --nnodes=1 \
    --nproc_per_node=8 \
    submission_runner.py \
    --framework=pytorch \
    --data_dir=/data/librispeech/ \
    --workload=librispeech_conformer \
    --experiment_dir=/experiment_runs \
    --experiment_name=librispeech_conformer_baseline \
  --submission_path=reference_algorithms/paper_baselines/nadamw/pytorch/submission.py \
    --tuning_search_space=reference_algorithms/paper_baselines/nadamw/tuning_search_space.json \
  --librispeech_tokenizer_vocab_path=/data/librispeech/spm_model.vocab \
  --num_tuning_trials=5 

Source or Possible Fix

We are not aware of a possible fix for this issue. We suspect there may be a memory leak in the PyTorch LibriSpeech Conformer workload. Please let us know how to proceed. Thanks in advance!

priyakasimbeg commented 4 months ago

What is the version of the GPU driver? Can you print the output of nvidia-smi in your environment?

We had a similar issue a few months ago https://github.com/mlcommons/algorithmic-efficiency/issues/497

hjmshi commented 4 months ago

Our error is deterministically reproducible.

Here is the nvidia-smi output in the Docker container:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.14              Driver Version: 550.54.14      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla V100-SXM2-16GB           Off |   00000000:00:17.0 Off |                    0 |
| N/A   65C    P0            296W /  300W |    7213MiB /  16384MiB |     81%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla V100-SXM2-16GB           Off |   00000000:00:18.0 Off |                    0 |
| N/A   54C    P0            237W /  300W |    9027MiB /  16384MiB |     82%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla V100-SXM2-16GB           Off |   00000000:00:19.0 Off |                    0 |
| N/A   53C    P0            146W /  300W |    8951MiB /  16384MiB |     95%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla V100-SXM2-16GB           Off |   00000000:00:1A.0 Off |                    0 |
| N/A   64C    P0            103W /  300W |    8921MiB /  16384MiB |     88%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  Tesla V100-SXM2-16GB           Off |   00000000:00:1B.0 Off |                    0 |
| N/A   58C    P0             79W /  300W |    8991MiB /  16384MiB |     79%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  Tesla V100-SXM2-16GB           Off |   00000000:00:1C.0 Off |                    0 |
| N/A   57C    P0             80W /  300W |    9007MiB /  16384MiB |     73%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  Tesla V100-SXM2-16GB           Off |   00000000:00:1D.0 Off |                    0 |
| N/A   52C    P0             73W /  300W |    9091MiB /  16384MiB |     63%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  Tesla V100-SXM2-16GB           Off |   00000000:00:1E.0 Off |                    0 |
| N/A   60C    P0             82W /  300W |    9053MiB /  16384MiB |     93%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
priyakasimbeg commented 4 months ago

Ok I see, sorry I did not catch that it was consistently OOMing.

We can try to clear the cache between trials. Could you try adding _reset_cuda_mem() in submission_runner.py, line 542 ( https://github.com/mlcommons/algorithmic-efficiency/blob/main/submission_runner.py#L542) and let me know if that resolves the issue?

@chandramouli-sastry @pomonam @msaroufim do you have any other ideas?

anana10c commented 4 months ago

Thank you for the suggestion! We've confirmed that adding _reset_cuda_mem() solves this issue. However, we are now seeing that trial 5 consistently OOMs, even when run on its own (and regardless of whether _reset_cuda_mem() is added). Here's what the error looks like:

I0301 22:57:47.301353 140438874752832 submission_runner.py:314] Starting training loop.                                                                                                             [153/1900]
I0301 22:57:51.364732 140412236850944 logging_writer.py:48] [0] global_step=0, grad_norm=35.902340, loss=31.931053                                                                                            
I0301 22:57:51.381769 140438874752832 pytorch_nadamw_full_budget.py:296] 0) loss = 31.931, grad_norm = 35.902                                                                                                 
I0301 22:57:51.775843 140438874752832 spec.py:321] Evaluating on the training split.                                                                                                                          
I0301 22:57:51.776754 140438874752832 input_pipeline.py:20] Loading split = train-clean-100                                                                                                                   
I0301 22:57:51.803838 140438874752832 input_pipeline.py:20] Loading split = train-clean-360                                                                                                                   
I0301 22:57:51.906127 140438874752832 input_pipeline.py:20] Loading split = train-other-500                                                                                                                   
I0301 22:58:06.571050 140438874752832 spec.py:333] Evaluating on the validation split.                                                                                                                        
I0301 22:58:06.572633 140438874752832 input_pipeline.py:20] Loading split = dev-clean                                                                                                                         
I0301 22:58:06.576136 140438874752832 input_pipeline.py:20] Loading split = dev-other                                                                                                                         
I0301 22:58:17.383088 140438874752832 spec.py:349] Evaluating on the test split.                                                                                                                              
I0301 22:58:17.384322 140438874752832 input_pipeline.py:20] Loading split = test-clean                                                                                                                        
I0301 22:58:22.858831 140438874752832 submission_runner.py:414] Time since start: 35.56s,       Step: 1,        {'train/ctc_loss': 31.21902909067492, 'train/wer': 2.106045848546352, 'validation/ctc_loss': 3
0.200058752570424, 'validation/wer': 1.494539661082412, 'validation/num_examples': 5348, 'test/ctc_loss': 30.30062109410089, 'test/wer': 1.5860296955294213, 'test/num_examples': 2472, 'score': 4.08172893524
1699, 'total_duration': 35.55763244628906, 'accumulated_submission_time': 4.081728935241699, 'accumulated_eval_time': 31.082834243774414, 'accumulated_logging_time': 0}                                      
I0301 22:58:22.872175 140410752063232 logging_writer.py:48] [1] accumulated_eval_time=31.082834, accumulated_logging_time=0, accumulated_submission_time=4.081729, global_step=1, preemption_count=0, score=4.
081729, test/ctc_loss=30.300621, test/num_examples=2472, test/wer=1.586030, total_duration=35.557632, train/ctc_loss=31.219029, train/wer=2.106046, validation/ctc_loss=30.200059, validation/num_examples=534
8, validation/wer=1.494540                                                                                                                                                                                    
I0301 22:58:23.482473 140438874752832 checkpoint_utils.py:240] Saved checkpoint to /experiment_runs/shampoo/tmp/librispeech_conformer_pytorch/trial_5/checkpoint_1.                                           
I0301 22:58:25.199719 140410533967616 logging_writer.py:48] [1] global_step=1, grad_norm=35.834045, loss=31.748371                                                                                            
I0301 22:58:25.202767 140438874752832 pytorch_nadamw_full_budget.py:296] 1) loss = 31.748, grad_norm = 35.834                                                                                                 
I0301 22:58:26.362531 140410752063232 logging_writer.py:48] [2] global_step=2, grad_norm=43.194149, loss=31.490442
I0301 22:58:26.365588 140438874752832 pytorch_nadamw_full_budget.py:296] 2) loss = 31.490, grad_norm = 43.194
Traceback (most recent call last):
  File "submission_runner.py", line 698, in <module>
    app.run(main)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.8/dist-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "submission_runner.py", line 666, in main
    score = score_submission_on_workload(
  File "submission_runner.py", line 577, in score_submission_on_workload
    timing, metrics = train_once(workload, workload_name,
  File "submission_runner.py", line 336, in train_once
    optimizer_state, model_params, model_state = update_params(
  File "/algorithmic-efficiency/submissions/baseline_submission/pytorch_nadamw_full_budget.py", line 276, in update_params
    loss.backward()
  File "/usr/local/lib/python3.8/dist-packages/torch/_tensor.py", line 492, in backward
    torch.autograd.backward(
  File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.51 GiB. GPU 4 has a total capacty of 15.77 GiB of which 767.31 MiB is free. Process 610974 has 15.01 GiB memory in use. Of the allocated 
memory 6.26 GiB is allocated by PyTorch, and 7.79 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documenta
tion for Memory Management and PYTORCH_CUDA_ALLOC_CONF

This trial corresponds to the following hyperparameter set:

{
  "dropout_rate": 0.1,
  "label_smoothing": 0.0,
  "learning_rate": 0.0017486387539278373,
  "one_minus_beta1": 0.06733926164,
  "beta2": 0.9955159689799007,
  "weight_decay": 0.08121616522670176,
  "warmup_factor": 0.02
}

We observed that changing dropout_rate to 0.0 no longer OOMs, as does halving the batch size. However, we haven't been able to run the baseline with this set of hyperparameters as-is.

priyakasimbeg commented 4 months ago

Ok thanks for the update. I will make a fix to reset the cuda mem in our repo. We will try to reproduce and investigate the dropout issue on our end.

Also, I assume you're running these to produce some baseline logs? I checked in logs on our dev branch for under prize_qualification_baselines/external_tuning/logs. These were produced with JAX fyi which is probably why we haven't noticed this issue. But I hope this will unblock you for the time being.

anana10c commented 4 months ago

Thank you so much! Yes, we wanted to sanity check the baseline on our setup. Thank you for pointing us towards the logs, they will indeed be helpful :)

priyakasimbeg commented 4 months ago

It seems like @pomonam was unable to reproduce this issue with dropout. We probably have a fix though, could you modify https://github.com/mlcommons/algorithmic-efficiency/blob/main/algorithmic_efficiency/workloads/librispeech_conformer/librispeech_pytorch/models.py and pass in inplace=True in the nn.Dropout initializations. This should reduce the memory footprint of the dropout layer. If this fixes things for you, we'll make this change as well.

priyakasimbeg commented 4 months ago

From offline discussion w @anana10c and team it sounds like setting inplace=True resolved the OOM issue. @pomonam could you send over a PR with

  1. _reset_cuda_mem() addition at the beginning of each trial.
  2. inplace=True in the PyTorch nn.Dropout layer initializations.
anana10c commented 4 months ago

Sorry for the delayed response - unfortunately, we've found that while inplace=True lets the trial 5 run continue for longer, it still ends up OOMing at around iteration 3600.