Open albertz opened 1 week ago
Sometimes also like this:
...
ep 28 train, step 56, ctc_4 2.616, ctc_8 2.268, ctc 2.221, num_seqs 8, max_size:time 278344, max_size:out-spatial 67, mem_usage:cuda:0 6.3GB, 0.658 sec/step
ep 28 train, step 56, ctc_4 2.049, ctc_8 1.746, ctc 1.688, num_seqs 8, max_size:time 276496, max_size:out-spatial 62, mem_usage:cuda:2 6.3GB, 0.678 sec/step
ep 28 train, step 57, ctc_4 2.239, ctc_8 1.990, ctc 1.961, num_seqs 8, max_size:time 278959, max_size:out-spatial 61, mem_usage:cuda:0 6.3GB, 0.653 sec/step
ep 28 train, step 57, ctc_4 2.137, ctc_8 1.780, ctc 1.708, num_seqs 8, max_size:time 280104, max_size:out-spatial 60, mem_usage:cuda:3 6.3GB, 0.674 sec/step
ep 28 train, step 57, ctc_4 2.338, ctc_8 1.937, ctc 1.926, num_seqs 9, max_size:time 252480, max_size:out-spatial 55, mem_usage:cuda:1 6.3GB, 0.693 sec/step
ep 28 train, step 57, ctc_4 3.121, ctc_8 2.822, ctc 2.807, num_seqs 8, max_size:time 276760, max_size:out-spatial 64, mem_usage:cuda:2 6.3GB, 0.675 sec/step
ep 28 train, step 58, ctc_4 2.397, ctc_8 2.037, ctc 1.967, num_seqs 9, max_size:time 255120, max_size:out-spatial 65, mem_usage:cuda:3 6.3GB, 0.631 sec/step
ep 28 train, step 58, ctc_4 2.598, ctc_8 2.242, ctc 2.165, num_seqs 8, max_size:time 279224, max_size:out-spatial 56, mem_usage:cuda:0 6.3GB, 0.657 sec/step
ep 28 train, step 58, ctc_4 2.433, ctc_8 2.155, ctc 2.129, num_seqs 10, max_size:time 228024, max_size:out-spatial 63, mem_usage:cuda:1 6.3GB, 0.628 sec/step
MEMORY: sub proc TDL worker 0(5599) increased RSS: rss=524.3MB pss=372.6MB uss=356.5MB shared=167.8MB
MEMORY: sub proc TDL worker 0(5603) increased RSS: rss=454.3MB pss=302.6MB uss=286.5MB shared=167.7MB
MEMORY: sub proc TDL worker 0(5600) increased RSS: rss=523.1MB pss=371.6MB uss=355.5MB shared=167.6MB
MEMORY: total (main 3853, 2024-06-28, 17:46:24, 21 procs): pss=6.3GB uss=6.0GB
MEMORY: total (main 3850, 2024-06-28, 17:46:24, 21 procs): pss=6.7GB uss=6.4GB
MEMORY: total (main 3851, 2024-06-28, 17:46:24, 21 procs): pss=6.7GB uss=6.3GB
MEMORY: sub proc TDL worker 0(5602) increased RSS: rss=542.4MB pss=390.7MB uss=374.6MB shared=167.7MB
MEMORY: total (main 3852, 2024-06-28, 17:46:24, 21 procs): pss=6.4GB uss=6.1GB
RuntimeError: CUDA error: an illegal instruction was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Unhandled exception <class 'RuntimeError'> in thread <_MainThread(MainThread, started 130292506959872)>, proc 3852.
...
Send signal SIGINT to pid 4123/'train worker proc 4/4'
Send signal SIGINT to pid 4119/'train worker proc 3/4'
Send signal SIGINT to pid 5063/'devtrain worker proc 1/4'
Send signal SIGINT to pid 5064/'devtrain worker proc 2/4'
Send signal SIGINT to pid 5065/'devtrain worker proc 3/4'
Send signal SIGINT to pid 5066/'devtrain worker proc 4/4'
Send signal SIGINT to pid 5602/'NonDaemonicSpawnProcess-15'
Send signal SIGINT to pid 4604/'dev worker proc 2/4'
Send signal SIGINT to pid 4611/'dev worker proc 4/4'
Send signal SIGINT to pid 4607/'dev worker proc 3/4'
Send signal SIGINT to pid 4601/'dev worker proc 1/4'
Send signal SIGINT to pid 4114/'train worker proc 1/4'
[2024-06-28 17:46:56,408] INFO: Run time: 0:03:16 CPU: 1.00% RSS: 21.22GB VMS: 733.07GB
And then hanging.
Procs:
zeyer@cn-252 ~ % ps a --forest -u $(whoami) -o pid,comm
PID COMMAND
6791 sshd
6792 \_ zsh
6810 \_ ps
3790 slurm_script
3804 \_ python3.11
3832 \_ python3.11
3850 \_ python3.11
3989 | \_ python3.11
3995 | \_ watch memory
4110 | \_ MPD worker 0
4111 | \_ MPD worker 1
4115 | \_ MPD worker 2
4121 | \_ MPD worker 3
4589 | \_ python3.11
4600 | \_ MPD worker 0
4603 | \_ MPD worker 1
4608 | \_ MPD worker 2
4612 | \_ MPD worker 3
5057 | \_ MPD worker 0
5059 | \_ MPD worker 1
5061 | \_ MPD worker 2
5062 | \_ MPD worker 3
5603 | \_ TDL worker 0
5841 | \_ MPD worker 0
5944 | \_ MPD worker 1
6053 | \_ MPD worker 2
6159 | \_ MPD worker 3
3851 \_ python3.11
3991 | \_ python3.11
3993 | \_ watch memory
4112 | \_ MPD worker 0
4116 | \_ MPD worker 1
4120 | \_ MPD worker 2
4124 | \_ MPD worker 3
4577 | \_ python3.11
4602 | \_ MPD worker 0
4606 | \_ MPD worker 1
4609 | \_ MPD worker 2
4614 | \_ MPD worker 3
5051 | \_ MPD worker 0
5053 | \_ MPD worker 1
5055 | \_ MPD worker 2
5056 | \_ MPD worker 3
5600 | \_ TDL worker 0
5842 | \_ MPD worker 0
5947 | \_ MPD worker 1
6055 | \_ MPD worker 2
6163 | \_ MPD worker 3
3852 \_ python3.11 <defunct>
3853 \_ python3.11
3988 \_ python3.11
3992 \_ watch memory
4113 \_ MPD worker 0
4118 \_ MPD worker 1
4122 \_ MPD worker 2
4125 \_ MPD worker 3
4583 \_ python3.11
4599 \_ MPD worker 0
4605 \_ MPD worker 1
4610 \_ MPD worker 2
4613 \_ MPD worker 3
5052 \_ MPD worker 0
5054 \_ MPD worker 1
5058 \_ MPD worker 2
5060 \_ MPD worker 3
5599 \_ TDL worker 0
5840 \_ MPD worker 0
5945 \_ MPD worker 1
6049 \_ MPD worker 2
6157 \_ MPD worker 3
Those procs just hang. E.g. py-spy:
% py-spy dump -p 3850
Process 3850: /work/tools/users/zeyer/py-envs/py3.11-torch2.1/bin/python3.11 -u /u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py /u/zeyer/setups/combined/2021-05-31/work/i6_core/returnn/training/ReturnnTrainingJob.ns6wGzNHZ8zI/output/returnn.config
Python v3.11.2 (/work/tools/users/zeyer/linuxbrew/Cellar/python@3.11/3.11.2_1/bin/python3.11)
^C
Distributed training, single node, 4 GPUs.
And then it hangs.
And stack trace:
dmesg
:It might be just a GPU issue.
Maybe related:
314, although that was Horovod.
1520, some error (not timeout as here), but then also hang.
1496, some error, but then also hangs. Although the hang was then attributed to another problem (https://github.com/rwth-i6/returnn/issues/1497).