Open albertz opened 10 months ago
This is very deterministic, when I restart, I get the same crash exactly in the same crash, also on other nodes.
I get the same problem also with Gloo backend, i.e. also CUDA OOM, although then it crashes in a different way with an abort.
...
ep 1 train, step 97, acc 0.004, loss 8.624, loss_att 8.769, loss_ctc 8.285, total 8.624, mem_usage:cuda:2 8.8GB, 0.855 sec/step
ep 1 train, step 98, acc 0.007, loss 9.071, loss_att 9.377, loss_ctc 8.356, total 9.071, mem_usage:cuda:1 8.6GB, 0.797 sec/step
ep 1 train, step 98, acc 0.004, loss 8.664, loss_att 8.846, loss_ctc 8.239, total 8.664, mem_usage:cuda:3 8.5GB, 0.801 sec/step
ep 1 train, step 98, acc 0.005, loss 8.674, loss_att 8.856, loss_ctc 8.248, total 8.674, mem_usage:cuda:0 8.9GB, 0.892 sec/step
ep 1 train, step 98, acc 0.003, loss 8.459, loss_att 8.575, loss_ctc 8.190, total 8.459, mem_usage:cuda:2 8.8GB, 0.834 sec/step
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fda36535617 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7fda364f098d in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7fda365f09f8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1d104 (0x7fda365c0104 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #4: <unknown function> + 0x4bc384a (0x7fd9e5be384a in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtor
ch_cpu.so)
frame #5: <unknown function> + 0x559d0a8 (0x7fd9e65bd0a8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtor
ch_cpu.so)
frame #6: c10d::ProcessGroupGloo::AsyncWork::execute(c10::intrusive_ptr<c10d::ProcessGroupGloo::AsyncWork, c10::detail::intrusive_target_default_null_typ
e<c10d::ProcessGroupGloo::AsyncWork> >) + 0x3b (0x7fd9e65cbf8b in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/
libtorch_cpu.so)
frame #7: c10d::ProcessGroupGloo::runLoop(int) + 0xe9 (0x7fd9e65cc099 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/tor
ch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0xdba24 (0x7fda369dda24 in /work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6)
frame #9: <unknown function> + 0x8523e (0x7fda6157023e in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #10: <unknown function> + 0x10617c (0x7fda615f117c in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
Fatal Python error: Aborted
Thread 0x00007fda4f220640 (most recent call first):
<no Python frame>
Thread 0x00007fd90f6ae640 (most recent call first):
<no Python frame>
Thread 0x00007fd90cead640 (most recent call first):
<no Python frame>
Thread 0x00007fd911eaf640 (most recent call first):
<no Python frame>
Thread 0x00007fd9006ac640 (most recent call first):
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 320 in wait
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 975 in run
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 995 in _bootstrap
Thread 0x00007fda614ea000 (most recent call first):
File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2055 in all_reduce
File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 47 in wrapper
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/distributed.py", line 160 in _sync_params_avg
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/distributed.py", line 99 in step_after_param_update
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 389 in train_epoch
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 239 in train
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 465 in execute_main_task
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 659 in main
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py", line 11 in <module>
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._c
...
Signal handler: signal 6:
/var/tmp/zeyer/returnn_native/native_signal_handler/476dd6f1a7/native_signal_handler.so(signal_handler+0x4b)[0x7fda2a87320b]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x7fda61527f40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x7fda61571e6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x7fda61527ea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x3cf40)[0x7fda61527f40]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x86e6f)[0x7fda61571e6f]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(raise+0x12)[0x7fda61527ea2]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(abort+0xc2)[0x7fda6151345c]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xa586a)[0x7fda369a786a]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xb107a)[0x7fda369b307a]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xb10e5)[0x7fda369b30e5]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xb1338)[0x7fda369b3338]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so(_ZN3c106detail14torchCheckFailEPKcS2_jRKSs+0x94)[0x7fda3
64f09bd]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so(_ZN3c104cuda29c10_cuda_check_implementationEiPKcS2_
ib+0x118)[0x7fda365f09f8]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so(+0x1d104)[0x7fda365c0104]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(+0x4bc384a)[0x7fd9e5be384a]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(+0x559d0a8)[0x7fd9e65bd0a8]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(_ZN4c10d16ProcessGroupGloo9AsyncWork7executeEN3c10
13intrusive_ptrIS1_NS2_6detail34intrusive_target_default_null_typeIS1_EEEE+0x3b)[0x7fd9e65cbf8b]
/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so(_ZN4c10d16ProcessGroupGloo7runLoopEi+0xe9)[0x7fd9e
65cc099]
/work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6(+0xdba24)[0x7fda369dda24]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x8523e)[0x7fda6157023e]
/work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6(+0x10617c)[0x7fda615f117c]
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f5418512617 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/si
te-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f54184cd98d in /work/tools/users/zeyer/py-en
vs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x118 (0x7f54185cd9f8 in /work/tools/users/zeyer/py-envs/p
y3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0x1d104 (0x7f541859d104 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libc10_c
uda.so)
frame #4: <unknown function> + 0x4bc384a (0x7f53d37e384a in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtor
ch_cpu.so)
frame #5: <unknown function> + 0x559d0a8 (0x7f53d41bd0a8 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/libtor
ch_cpu.so)
frame #6: c10d::ProcessGroupGloo::AsyncWork::execute(c10::intrusive_ptr<c10d::ProcessGroupGloo::AsyncWork, c10::detail::intrusive_target_default_null_typ
e<c10d::ProcessGroupGloo::AsyncWork> >) + 0x3b (0x7f53d41cbf8b in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/lib/
libtorch_cpu.so)
frame #7: c10d::ProcessGroupGloo::runLoop(int) + 0xe9 (0x7f53d41cc099 in /work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/tor
ch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0xdba24 (0x7f54244b8a24 in /work/tools/users/zeyer/linuxbrew/lib/gcc/11/libstdc++.so.6)
frame #9: <unknown function> + 0x8523e (0x7f544f05123e in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
frame #10: <unknown function> + 0x10617c (0x7f544f0d217c in /work/tools/users/zeyer/linuxbrew/opt/glibc/lib/libc.so.6)
Fatal Python error: Aborted
Thread 0x00007f543add4640 (most recent call first):
<no Python frame>
Thread 0x00007f52fd7af640 (most recent call first):
<no Python frame>
Thread 0x00007f52f87ad640 (most recent call first):
<no Python frame>
Thread 0x00007f52fafae640 (most recent call first):
<no Python frame>
Thread 0x00007f52f5fac640 (most recent call first):
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 320 in wait
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/multiprocessing/queues.py", line 231 in _feed
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 975 in run
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 1038 in _bootstrap_inner
File "/work/tools/users/zeyer/linuxbrew/opt/python@3.11/lib/python3.11/threading.py", line 995 in _bootstrap
Thread 0x00007f544efcb000 (most recent call first):
File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/distributed_c10d.py", line 2055 in all_reduce
File "/work/tools/users/zeyer/py-envs/py3.11-torch2.1/lib/python3.11/site-packages/torch/distributed/c10d_logger.py", line 47 in wrapper
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/distributed.py", line 160 in _sync_params_avg
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/distributed.py", line 99 in step_after_param_update
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 389 in train_epoch
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/torch/engine.py", line 239 in train
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 465 in execute_main_task
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/returnn/__main__.py", line 659 in main
File "/u/zeyer/setups/combined/2021-05-31/tools/returnn/rnn.py", line 11 in <module>
Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._c
...
In this case, as you see, all the workers crash in the same way.
This is very deterministic, when I restart, I get the same crash exactly in the same crash, also on other nodes.
I realized, this is using "torch_distributed": {"reduce_type": "param", "param_sync_step": 100}
, and it did not yet print the log output for the current step, which is step 99, so this is exactly the first step where it performs the param sync.
One workaround is using the newly introduced torch_distributed
sync_on_cpu=True
option, which first moves all params to CPU, then does the sync (which would use Gloo on CPU), then moves it back to GPU.
But why does this work? What does NCCL/Gloo do different, when the param is on GPU? This is a GeForce GTX 1080, so there is no NVlink. So I was assuming it would anyway internally move it to CPU, then do the allreduce on CPU, and then back to GPU. But probably not? Maybe it copies all params to CPU, then over network to all workers, then each copy of the param to GPU, so it has num_workers times the param in memory, and then does the reduce (AVG or SUM) on GPU? This might explain it. But I was assuming that the all_reduce
is somewhat more clever, maybe does it hierarchically or so, i.e. not use this naive logic, which is not the most efficient and takes so much memory?
Note, the 1080 has 10.9GB of memory, just the parameters take only 615.9MB of memory.
The all_reduce
is in blocking mode (just the default), and we do this separately for each parameter. The biggest parameter might be the embedding (512 x 100025), although that is not where it crashes. In any case, even if we would have 4 times such a big parameter in memory, it should be way more than enough memory available, so this does not really explain it.
I also asked in the forums: https://discuss.pytorch.org/t/cuda-oom-in-distributed-training-without-nvlink/194704
Note, in https://github.com/pytorch/pytorch/issues/116177 (and https://github.com/NVIDIA/nccl/issues/1197), there was the hint to use NCCL_NVLS_ENABLE=0
as another workaround for this. (I did not try this yet.)
With more NCCL debug info:
...
ep 13 train, step 97, ctc_4 4.683, ctc_8 4.594, ctc 4.674, aed_ce 5.543, aed_fer 0.817, num_seqs 9, max_size:time 241561, max_size:out-spatial 63, mem_usage:cuda:3 8.8GB, 0.888 sec/step
ep 13 train, step 98, ctc_4 4.578, ctc_8 4.435, ctc 4.499, aed_ce 5.339, aed_fer 0.823, num_seqs 12, max_size:time 156520, max_size:out-spatial 53, mem_usage:cuda:0 8.9GB, 0.812 sec/step
ep 13 train, step 98, ctc_4 4.089, ctc_8 3.892, ctc 3.930, aed_ce 5.001, aed_fer 0.776, num_seqs 10, max_size:time 220720, max_size:out-spatial 55, mem_usage:cuda:1 9.0GB, 0.882 sec/step
ep 13 train, step 98, ctc_4 4.404, ctc_8 4.170, ctc 4.206, aed_ce 5.589, aed_fer 0.831, num_seqs 9, max_size:time 243673, max_size:out-spatial 47, mem_usage:cuda:2 9.0GB, 0.872 sec/step
ep 13 train, step 98, ctc_4 3.731, ctc_8 3.489, ctc 3.528, aed_ce 5.102, aed_fer 0.806, num_seqs 9, max_size:time 241561, max_size:out-spatial 53, mem_usage:cuda:3 8.8GB, 0.872 sec/step
cn-241:3260080:3260080 [0] NCCL INFO Bootstrap : Using enp5s0:10.6.9.41<0>
cn-241:3260080:3260080 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
cn-241:3260080:3260080 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
cn-241:3260080:3260080 [0] NCCL INFO cudaDriverVersion 12010
NCCL version 2.18.1+cuda12.1
cn-241:3260080:3262931 [0] NCCL INFO NET/IB : No device found.
cn-241:3260080:3262931 [0] NCCL INFO NET/Socket : Using [0]enp5s0:10.6.9.41<0>
cn-241:3260080:3262931 [0] NCCL INFO Using network Socket
DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:219, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.18.1
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'out of memory'
I think we can handle that a bit better. I think NCCL can be initialized such that it reserves the needed memory already in advance. https://github.com/pytorch/pytorch/issues/116177#issuecomment-2343822534:
One thing I recommend is to eagerly initialize nccl and then check the free GPU memory before doing a collective. To eagerly initialize nccl, simply pass
device_id=torch.device("cuda:0")
or whatever device index you want, totorch.distributed.init_process_group()
. When doing this, nccl initialization will happen during that API call, and then NCCL should not consume additional memory on the first allreduce call.
But passing device_id
is only possible in newer PyTorch version.
I also read that this is dependent on the NCCL version. Newer NCCL versions might require less memory: https://github.com/NVIDIA/nccl/issues/1197#issuecomment-1980391319:
NCCL 2.21 will reduce the NVLS memory usage significantly as we've found that NVLS memory usage was a problem for codes which were already close to using all memory. It will still use more memory than with NVLS disabled though; we're working on reducing memory usage even further in NCCL 2.22.
We might need to redesign the way we handle distributed computing in Torch a bit. Currently we do a single dist.init_process_group(backend=None)
call to initialize a global process group. I think we maybe want to create explicit process groups, for CPU and CUDA.
Note that
RuntimeError: CUDA error: out of memory
is not the usualOutOfMemoryError
exception (which also provides some stats on reserved memory etc) but this comes from torch distributed and unfortunately lacks further stats.It's a bit strange because looking at the training log before the OOM, it uses around 7.4GB (allocated, so a bit more reserved), and from the initial log, all the device memory seem to be available?