Hang in training (often with multi GPU training)

Distributed training, single node, 4 GPUs.

...
ep 33 train, step 10, ctc_4 2.259, ctc_8 1.880, ctc 1.840, num_seqs 10, max_size:time 237360, max_size:out-spatial 52, mem_usage:cuda:1 6.6GB, 0.600 sec/step
ep 33 train, step 11, ctc_4 2.163, ctc_8 1.883, ctc 1.840, num_seqs 9, max_size:time 263648, max_size:out-spatial 57, mem_usage:cuda:3 6.6GB, 0.604 sec/step
ep 33 train, step 11, ctc_4 2.344, ctc_8 1.988, ctc 1.912, num_seqs 9, max_size:time 261801, max_size:out-spatial 55, mem_usage:cuda:2 6.5GB, 0.633 sec/step
ep 33 train, step 11, ctc_4 2.879, ctc_8 2.534, ctc 2.524, num_seqs 9, max_size:time 263648, max_size:out-spatial 54, mem_usage:cuda:0 6.7GB, 0.690 sec/step
ep 33 train, step 11, ctc_4 2.072, ctc_8 1.608, ctc 1.537, num_seqs 9, max_size:time 262065, max_size:out-spatial 55, mem_usage:cuda:1 6.6GB, 0.667 sec/step
ep 33 train, step 12, ctc_4 2.517, ctc_8 2.150, ctc 2.065, num_seqs 10, max_size:time 238480, max_size:out-spatial 55, mem_usage:cuda:1 6.6GB, 0.601 sec/step
ep 33 train, step 12, ctc_4 2.652, ctc_8 2.407, ctc 2.361, num_seqs 10, max_size:time 240000, max_size:out-spatial 59, mem_usage:cuda:0 6.7GB, 0.679 sec/step
ep 33 train, step 12, ctc_4 2.592, ctc_8 2.339, ctc 2.295, num_seqs 11, max_size:time 214632, max_size:out-spatial 59, mem_usage:cuda:2 6.5GB, 0.711 sec/step
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:81] Timed out waiting 1800000ms for recv operation to complete
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 132166299779072)>, proc 11842.

Thread <Thread(QueueFeederThread, started daemon 132146790385216)>:
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 128575112818688)>, proc 11843.

Thread <Thread(QueueFeederThread, started daemon 128555719587392)>:
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 134816991117312)>, proc 11844.

...

Send signal SIGINT to pid 13385/'devtrain worker proc 2/4'
Send signal SIGINT to pid 15767/'NonDaemonicSpawnProcess-17'
Send signal SIGINT to pid 12108/'train worker proc 2/4'
Send signal SIGINT to pid 12105/'train worker proc 3/4'
Send signal SIGINT to pid 13012/'devtrain worker proc 2/4'
Send signal SIGINT to pid 13013/'devtrain worker proc 3/4'
Send signal SIGINT to pid 12899/'dev worker proc 2/4'
Send signal SIGINT to pid 12900/'dev worker proc 3/4'
Send signal SIGINT to pid 13014/'devtrain worker proc 4/4'
Send signal SIGINT to pid 12100/'train worker proc 1/4'
Send signal SIGINT to pid 12101/'train worker proc 2/4'
Send signal SIGINT to pid 12109/'train worker proc 4/4'
Send signal SIGINT to pid 13011/'devtrain worker proc 1/4'
Send signal SIGINT to pid 12898/'dev worker proc 1/4'
Send signal SIGINT to pid 14968/'NonDaemonicSpawnProcess-16'
Send signal SIGINT to pid 12901/'dev worker proc 4/4'
Send signal SIGINT to pid 15768/'NonDaemonicSpawnProcess-17'
[2024-06-28 00:32:35,464] INFO: Run time: 7:10:22 CPU: 0.60% RSS: 13.41GB VMS: 343.25GB
...

And then it hangs.

% ps a --forest -u $(whoami) -o pid,comm
    PID COMMAND
  35776 sshd
  35777  \_ zsh
  35819      \_ ps
  11806 slurm_script
  11811  \_ python3.11
  11824      \_ python3.11
  11842          \_ python3.11 <defunct>
  11843          \_ python3.11 <defunct>
  11986          |   \_ python3.11 <defunct>
  11844          \_ python3.11
  11983          |   \_ python3.11 <defunct>
  11845          \_ python3.11
  11982              \_ python3.11
  11984              \_ watch memory
  12104              \_ MPD worker 0
  12107              \_ MPD worker 1
  12111              \_ MPD worker 2
  12114              \_ MPD worker 3
  13134              \_ python3.11
  13146              \_ MPD worker 0
  13147              \_ MPD worker 1
  13148              \_ MPD worker 2
  13149              \_ MPD worker 3
  13388              \_ MPD worker 0
  13389              \_ MPD worker 1
  13390              \_ MPD worker 2
  13391              \_ MPD worker 3
  13679              \_ TDL worker 0
  13920              |   \_ MPD worker 0
  14022              |   \_ MPD worker 1
  14131              |   \_ MPD worker 2
  14241              |   \_ MPD worker 3
  14967              \_ TDL worker 0
  15213              |   \_ MPD worker 0
  15317              |   \_ MPD worker 1
  15462              |   \_ MPD worker 2
  15555              |   \_ MPD worker 3
  15766              \_ TDL worker 0
  16005                  \_ MPD worker 0
  16110                  \_ MPD worker 1
  16210                  \_ MPD worker 2
  16309                  \_ MPD worker 3

And stack trace:

% py-spy dump -p 11845
Process 11845: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 -u /u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.lzGIgU9Oy8je/output/returnn.config
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/python@3.11/3.11.2_1/bin/python3.11)

Thread 11845 (idle): "MainThread"
    backward (torch/autograd/__init__.py:251)
    backward (torch/_tensor.py:492)
    train_epoch (returnn/returnn/torch/engine.py:421)
    train (returnn/returnn/torch/engine.py:236)
    execute_main_task (returnn/returnn/__main__.py:542)
    main (returnn/returnn/__main__.py:736)
    <module> (returnn/rnn.py:11)
Thread 13782 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)
Thread 14735 (idle)
Thread 14734 (idle)
Thread 14737 (active)
Thread 14736 (idle)
Thread 15070 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)
Thread 15868 (idle): "QueueFeederThread"
    wait (threading.py:320)
    _feed (multiprocessing/queues.py:231)
    run (threading.py:975)
    _bootstrap_inner (threading.py:1038)
    _bootstrap (threading.py:995)

dmesg:

...
[  +0.000127] nvidia 0000:82:00.0: AER: can't recover (no error_detected callback)
[  +0.000002] snd_hda_intel 0000:82:00.1: AER: can't recover (no error_detected callback)
[  +0.000016] pcieport 0000:80:03.0: AER: device recovery failed
[  +0.000002] pcieport 0000:80:03.0: AER: Multiple Uncorrected (Fatal) error received: 0000:80:03.0
[  +0.000020] pcieport 0000:80:03.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[  +0.000184] pcieport 0000:80:03.0:   device [8086:6f08] error status/mask=00004020/00000000
[  +0.000156] pcieport 0000:80:03.0:    [ 5] SDES                  
[  +0.000148] pcieport 0000:80:03.0:    [14] CmpltTO                (First)
[  +0.000153] nvidia 0000:82:00.0: AER: can't recover (no error_detected callback)
[  +0.000001] snd_hda_intel 0000:82:00.1: AER: can't recover (no error_detected callback)
[  +0.130524] NVRM: GPU at PCI:0000:82:00: GPU-0c6d20ff-f6a7-957f-af96-ffdf1b880057
[  +0.000012] NVRM: GPU Board Serial Number: 0321217096330
[  +0.000003] NVRM: Xid (PCI:0000:82:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[  +0.000006] NVRM: GPU 0000:82:00.0: GPU has fallen off the bus.
[  +0.000003] NVRM: GPU 0000:82:00.0: GPU serial number is 0321217096330.
[  +0.000013] NVRM: A GPU crash dump has been created. If possible, please run
              NVRM: nvidia-bug-report.sh as root to collect this data before
              NVRM: the NVIDIA kernel module is unloaded.
[  +0.031970] pcieport 0000:80:03.0: AER: Root Port link has been reset (0)
[  +0.000032] pcieport 0000:80:03.0: AER: device recovery failed
[Jun28 00:03] pcieport 0000:80:03.0: AER: Uncorrected (Fatal) error received: 0000:80:03.0
[  +0.000013] pcieport 0000:80:03.0: PCIe Bus Error: severity=Uncorrected (Fatal), type=Transaction Layer, (Requester ID)
[  +0.000292] pcieport 0000:80:03.0:   device [8086:6f08] error status/mask=00004020/00000000
[  +0.000237] pcieport 0000:80:03.0:    [ 5] SDES                  
[  +0.000240] pcieport 0000:80:03.0:    [14] CmpltTO                (First)
[  +0.000245] nvidia 0000:82:00.0: AER: can't recover (no error_detected callback)
[  +0.000002] snd_hda_intel 0000:82:00.1: AER: can't recover (no error_detected callback)
[  +0.149679] pcieport 0000:80:03.0: AER: Root Port link has been reset (0)
[  +0.000031] pcieport 0000:80:03.0: AER: device recovery failed

It might be just a GPU issue.

Maybe related:

314, although that was Horovod.
1520, some error (not timeout as here), but then also hang.
1496, some error, but then also hangs. Although the hang was then attributed to another problem (https://github.com/rwth-i6/returnn/issues/1497).
https://github.com/rwth-i6/sisyphus/issues/176 and https://github.com/rwth-i6/sisyphus/pull/177

Sometimes also like this:

...

ep 28 train, step 56, ctc_4 2.616, ctc_8 2.268, ctc 2.221, num_seqs 8, max_size:time 278344, max_size:out-spatial 67, mem_usage:cuda:0 6.3GB, 0.658 sec/step
ep 28 train, step 56, ctc_4 2.049, ctc_8 1.746, ctc 1.688, num_seqs 8, max_size:time 276496, max_size:out-spatial 62, mem_usage:cuda:2 6.3GB, 0.678 sec/step
ep 28 train, step 57, ctc_4 2.239, ctc_8 1.990, ctc 1.961, num_seqs 8, max_size:time 278959, max_size:out-spatial 61, mem_usage:cuda:0 6.3GB, 0.653 sec/step
ep 28 train, step 57, ctc_4 2.137, ctc_8 1.780, ctc 1.708, num_seqs 8, max_size:time 280104, max_size:out-spatial 60, mem_usage:cuda:3 6.3GB, 0.674 sec/step
ep 28 train, step 57, ctc_4 2.338, ctc_8 1.937, ctc 1.926, num_seqs 9, max_size:time 252480, max_size:out-spatial 55, mem_usage:cuda:1 6.3GB, 0.693 sec/step
ep 28 train, step 57, ctc_4 3.121, ctc_8 2.822, ctc 2.807, num_seqs 8, max_size:time 276760, max_size:out-spatial 64, mem_usage:cuda:2 6.3GB, 0.675 sec/step
ep 28 train, step 58, ctc_4 2.397, ctc_8 2.037, ctc 1.967, num_seqs 9, max_size:time 255120, max_size:out-spatial 65, mem_usage:cuda:3 6.3GB, 0.631 sec/step
ep 28 train, step 58, ctc_4 2.598, ctc_8 2.242, ctc 2.165, num_seqs 8, max_size:time 279224, max_size:out-spatial 56, mem_usage:cuda:0 6.3GB, 0.657 sec/step
ep 28 train, step 58, ctc_4 2.433, ctc_8 2.155, ctc 2.129, num_seqs 10, max_size:time 228024, max_size:out-spatial 63, mem_usage:cuda:1 6.3GB, 0.628 sec/step
MEMORY: sub proc TDL worker 0(5599) increased RSS: rss=524.3MB pss=372.6MB uss=356.5MB shared=167.8MB
MEMORY: sub proc TDL worker 0(5603) increased RSS: rss=454.3MB pss=302.6MB uss=286.5MB shared=167.7MB
MEMORY: sub proc TDL worker 0(5600) increased RSS: rss=523.1MB pss=371.6MB uss=355.5MB shared=167.6MB
MEMORY: total (main 3853, 2024-06-28, 17:46:24, 21 procs): pss=6.3GB uss=6.0GB
MEMORY: total (main 3850, 2024-06-28, 17:46:24, 21 procs): pss=6.7GB uss=6.4GB
MEMORY: total (main 3851, 2024-06-28, 17:46:24, 21 procs): pss=6.7GB uss=6.3GB
MEMORY: sub proc TDL worker 0(5602) increased RSS: rss=542.4MB pss=390.7MB uss=374.6MB shared=167.7MB
MEMORY: total (main 3852, 2024-06-28, 17:46:24, 21 procs): pss=6.4GB uss=6.1GB
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 130292506959872)>, proc 3852.

...

Send signal SIGINT to pid 4123/'train worker proc 4/4'
Send signal SIGINT to pid 4119/'train worker proc 3/4'
Send signal SIGINT to pid 5063/'devtrain worker proc 1/4'
Send signal SIGINT to pid 5064/'devtrain worker proc 2/4'
Send signal SIGINT to pid 5065/'devtrain worker proc 3/4'
Send signal SIGINT to pid 5066/'devtrain worker proc 4/4'
Send signal SIGINT to pid 5602/'NonDaemonicSpawnProcess-15'
Send signal SIGINT to pid 4604/'dev worker proc 2/4'
Send signal SIGINT to pid 4611/'dev worker proc 4/4'
Send signal SIGINT to pid 4607/'dev worker proc 3/4'
Send signal SIGINT to pid 4601/'dev worker proc 1/4'
Send signal SIGINT to pid 4114/'train worker proc 1/4'
[2024-06-28 17:46:56,408] INFO: Run time: 0:03:16 CPU: 1.00% RSS: 21.22GB VMS: 733.07GB

And then hanging.

Procs:

zeyer@cn-252 ~ % ps a --forest -u $(whoami) -o pid,comm
    PID COMMAND                   
   6791 sshd                        
   6792  \_ zsh                     
   6810      \_ ps                  
   3790 slurm_script                
   3804  \_ python3.11              
   3832      \_ python3.11        
   3850          \_ python3.11      
   3989          |   \_ python3.11  
   3995          |   \_ watch memory
   4110          |   \_ MPD worker 0
   4111          |   \_ MPD worker 1
   4115          |   \_ MPD worker 2
   4121          |   \_ MPD worker 3
   4589          |   \_ python3.11  
   4600          |   \_ MPD worker 0
   4603          |   \_ MPD worker 1    
   4608          |   \_ MPD worker 2    
   4612          |   \_ MPD worker 3    
   5057          |   \_ MPD worker 0    
   5059          |   \_ MPD worker 1
   5061          |   \_ MPD worker 2
   5062          |   \_ MPD worker 3
   5603          |   \_ TDL worker 0
   5841          |       \_ MPD worker 0
   5944          |       \_ MPD worker 1
   6053          |       \_ MPD worker 2
   6159          |       \_ MPD worker 3
   3851          \_ python3.11
   3991          |   \_ python3.11
   3993          |   \_ watch memory
   4112          |   \_ MPD worker 0
   4116          |   \_ MPD worker 1
   4120          |   \_ MPD worker 2
   4124          |   \_ MPD worker 3
   4577          |   \_ python3.11
   4602          |   \_ MPD worker 0
   4606          |   \_ MPD worker 1
   4609          |   \_ MPD worker 2
   4614          |   \_ MPD worker 3
   5051          |   \_ MPD worker 0
   5053          |   \_ MPD worker 1
   5055          |   \_ MPD worker 2
   5056          |   \_ MPD worker 3
   5600          |   \_ TDL worker 0
   5842          |       \_ MPD worker 0
   5947          |       \_ MPD worker 1
   6055          |       \_ MPD worker 2
   6163          |       \_ MPD worker 3
   3852          \_ python3.11 <defunct>
   3853          \_ python3.11
   3988              \_ python3.11
   3992              \_ watch memory
   4113              \_ MPD worker 0
   4118              \_ MPD worker 1
   4122              \_ MPD worker 2
   4125              \_ MPD worker 3
   4583              \_ python3.11
   4599              \_ MPD worker 0
   4605              \_ MPD worker 1
   4610              \_ MPD worker 2
   4613              \_ MPD worker 3
   5052              \_ MPD worker 0
   5054              \_ MPD worker 1
   5058              \_ MPD worker 2
   5060              \_ MPD worker 3
   5599              \_ TDL worker 0
   5840                  \_ MPD worker 0
   5945                  \_ MPD worker 1
   6049                  \_ MPD worker 2
   6157                  \_ MPD worker 3

Those procs just hang. E.g. py-spy:

% py-spy dump -p 3850
Process 3850: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 -u /u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.ns6wGzNHZ8zI/output/returnn.config
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/python@3.11/3.11.2_1/bin/python3.11)

^C

rwth-i6 / returnn

Hang in training (often with multi GPU training) #1558

314, although that was Horovod.

1520, some error (not timeout as here), but then also hang.

1496, some error, but then also hangs. Although the hang was then attributed to another problem (https://github.com/rwth-i6/returnn/issues/1497).