Open albertz opened 6 months ago
This is very deterministic, when I restart, I get the same crash exactly in the same crash, also on other nodes.
I get the same problem also with Gloo backend, i.e. also CUDA OOM, although then it crashes in a different way with an abort.
...
ep 1 train, step 97, acc 0.004, loss 8.624, loss_att 8.769, loss_ctc 8.285, total 8.624, mem_usage:cuda:2 8.8GB, 0.855 sec/step
ep 1 train, step 98, acc 0.007, loss 9.071, loss_att 9.377, loss_ctc 8.356, total 9.071, mem_usage:cuda:1 8.6GB, 0.797 sec/step
ep 1 train, step 98, acc 0.004, loss 8.664, loss_att 8.846, loss_ctc 8.239, total 8.664, mem_usage:cuda:3 8.5GB, 0.801 sec/step
ep 1 train, step 98, acc 0.005, loss 8.674, loss_att 8.856, loss_ctc 8.248, total 8.674, mem_usage:cuda:0 8.9GB, 0.892 sec/step
ep 1 train, step 98, acc 0.003, loss 8.459, loss_att 8.575, loss_ctc 8.190, total 8.459, mem_usage:cuda:2 8.8GB, 0.834 sec/step
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fda36535617 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fda364f098d in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fda365f09f8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1d104 (0x7fda365c0104 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x4bc384a (0x7fd9e5be384a in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtor
ch_cpu.so)
frame #5: <unknown function> + 0x559d0a8 (0x7fd9e65bd0a8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtor
ch_cpu.so)
frame #6: c10d::ProcessGroupGloo::AsyncWork::execute(c10::intrusive_ptr<c10d::ProcessGroupGloo::AsyncWork, c10::detail::intrusive_target_default_null_typ
e<c10d::ProcessGroupGloo::AsyncWork> >) + 0x3b (0x7fd9e65cbf8b in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/
libtorch_cpu.so)
frame #7: c10d::ProcessGroupGloo::runLoop(int) + 0xe9 (0x7fd9e65cc099 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/tor
ch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0xdba24 (0x7fda369dda24 in /work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6)
frame #9: <unknown function> + 0x8523e (0x7fda6157023e in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #10: <unknown function> + 0x10617c (0x7fda615f117c in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
Fatal Python error: Aborted
Thread 0x00007fda4f220640 (most recent call first):
<no Python frame>
Thread 0x00007fd90f6ae640 (most recent call first):
<no Python frame>
Thread 0x00007fd90cead640 (most recent call first):
<no Python frame>
Thread 0x00007fd911eaf640 (most recent call first):
<no Python frame>
Thread 0x00007fd9006ac640 (most recent call first):
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 320 in wait
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 975 in run
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 995 in _bootstrap
Thread 0x00007fda614ea000 (most recent call first):
File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2055 in all_reduce
File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 47 in wrapper
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/distributed.py", line 160 in _sync_params_avg
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/distributed.py", line 99 in step_after_param_update
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 389 in train_epoch
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 239 in train
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 465 in execute_main_task
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 659 in main
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py", line 11 in <module>
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._c
...
Signal handler: signal 6:
/var/tmp/zeyer/returnn_native/native_signal_handler/476dd6f1a7/native_signal_handler.so(signal_handler+0x4b)[0x7fda2a87320b]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x7fda61527f40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x7fda61571e6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x7fda61527ea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x7fda61527f40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x7fda61571e6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x7fda61527ea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(abort+0xc2)[0x7fda6151345c]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xa586a)[0x7fda369a786a]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xb107a)[0x7fda369b307a]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xb10e5)[0x7fda369b30e5]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xb1338)[0x7fda369b3338]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c106detail14torchCheckFailEPKcS2_jRKSs+0x94)[0x7fda3
64f09bd]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so(_ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_
ib+0x118)[0x7fda365f09f8]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so(+0x1d104)[0x7fda365c0104]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(+0x4bc384a)[0x7fd9e5be384a]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(+0x559d0a8)[0x7fd9e65bd0a8]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(_ZN4c10d16ProcessGroupGloo9AsyncWork7executeEN3c10
13intrusive_ptrIS1_NS2_6detail34intrusive_target_default_null_typeIS1_EEEE+0x3b)[0x7fd9e65cbf8b]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(_ZN4c10d16ProcessGroupGloo7runLoopEi+0xe9)[0x7fd9e
65cc099]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xdba24)[0x7fda369dda24]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x8523e)[0x7fda6157023e]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x10617c)[0x7fda615f117c]
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5418512617 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/si
te-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f54184cd98d in /work/tools/users/zeyer/py-en
vs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f54185cd9f8 in /work/tools/users/zeyer/py-envs/p
y3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1d104 (0x7f541859d104 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_c
uda.so)
frame #4: <unknown function> + 0x4bc384a (0x7f53d37e384a in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtor
ch_cpu.so)
frame #5: <unknown function> + 0x559d0a8 (0x7f53d41bd0a8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtor
ch_cpu.so)
frame #6: c10d::ProcessGroupGloo::AsyncWork::execute(c10::intrusive_ptr<c10d::ProcessGroupGloo::AsyncWork, c10::detail::intrusive_target_default_null_typ
e<c10d::ProcessGroupGloo::AsyncWork> >) + 0x3b (0x7f53d41cbf8b in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/
libtorch_cpu.so)
frame #7: c10d::ProcessGroupGloo::runLoop(int) + 0xe9 (0x7f53d41cc099 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/tor
ch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0xdba24 (0x7f54244b8a24 in /work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6)
frame #9: <unknown function> + 0x8523e (0x7f544f05123e in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #10: <unknown function> + 0x10617c (0x7f544f0d217c in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
Fatal Python error: Aborted
Thread 0x00007f543add4640 (most recent call first):
<no Python frame>
Thread 0x00007f52fd7af640 (most recent call first):
<no Python frame>
Thread 0x00007f52f87ad640 (most recent call first):
<no Python frame>
Thread 0x00007f52fafae640 (most recent call first):
<no Python frame>
Thread 0x00007f52f5fac640 (most recent call first):
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 320 in wait
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 975 in run
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 995 in _bootstrap
Thread 0x00007f544efcb000 (most recent call first):
File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2055 in all_reduce
File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 47 in wrapper
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/distributed.py", line 160 in _sync_params_avg
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/distributed.py", line 99 in step_after_param_update
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 389 in train_epoch
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 239 in train
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 465 in execute_main_task
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 659 in main
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py", line 11 in <module>
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._c
...
In this case, as you see, all the workers crash in the same way.
This is very deterministic, when I restart, I get the same crash exactly in the same crash, also on other nodes.
I realized, this is using "torch_distributed": {"reduce_type": "param", "param_sync_step": 100}
, and it did not yet print the log output for the current step, which is step 99, so this is exactly the first step where it performs the param sync.
One workaround is using the newly introduced torch_distributed
sync_on_cpu=True
option, which first moves all params to CPU, then does the sync (which would use Gloo on CPU), then moves it back to GPU.
But why does this work? What does NCCL/Gloo do different, when the param is on GPU? This is a GeForce GTX 1080, so there is no NVlink. So I was assuming it would anyway internally move it to CPU, then do the allreduce on CPU, and then back to GPU. But probably not? Maybe it copies all params to CPU, then over network to all workers, then each copy of the param to GPU, so it has num_workers times the param in memory, and then does the reduce (AVG or SUM) on GPU? This might explain it. But I was assuming that the all_reduce
is somewhat more clever, maybe does it hierarchically or so, i.e. not use this naive logic, which is not the most efficient and takes so much memory?
Note, the 1080 has 10.9GB of memory, just the parameters take only 615.9MB of memory.
The all_reduce
is in blocking mode (just the default), and we do this separately for each parameter. The biggest parameter might be the embedding (512 x 100025), although that is not where it crashes. In any case, even if we would have 4 times such a big parameter in memory, it should be way more than enough memory available, so this does not really explain it.
I also asked in the forums: https://discuss.pytorch.org/t/cuda-oom-in-distributed-training-without-nvlink/194704
Note that
RuntimeError: CUDA error: out of memory
is not the usualOutOfMemoryError
exception (which also provides some stats on reserved memory etc) but this comes from torch distributed and unfortunately lacks further stats.It's a bit strange because looking at the training log before the OOM, it uses around 7.4GB (allocated, so a bit more reserved), and from the initial log, all the device memory seem to be available?