Barrier tries to allocate 8 MB that my run can't afford:
File "/home/ubuntu/carlos/torchtitan/train.py", line 451, in main
utils.set_pg_timeouts(
File "/home/ubuntu/carlos/torchtitan/torchtitan/utils.py", line 75, in set_pg_timeouts
torch.distributed.barrier(device_ids=[torch.cuda.current_device()])
File "/home/ubuntu/.pyenv/versions/titan/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 83, in wrapper
return func(*args, **kwargs)
File "/home/ubuntu/.pyenv/versions/titan/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 4376, in barrier
work = group.barrier(opts=opts)
torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/NCCLUtils.hpp:340, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.21.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Failed to CUDA calloc 8388608 bytes
Since the script only runs this code after the first step, it should not be too expensive to empty_cache right before.
My run already sets export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"
This seems ok, but I am worried there might be some other underlying issue. I am surprised that you cannot afford 8 MB. Is that in line with what you expect / see in your GPU memory profiling?
Barrier tries to allocate 8 MB that my run can't afford:
Since the script only runs this code after the first step, it should not be too expensive to
empty_cache
right before.My run already sets
export PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True"