RuntimeError: Backend nccl does not support allgather_into_tensor_coalesced

zjost commented 1 month ago

🐛 Describe the bug

I am using torchtune and receive the error in the title whenever it goes to save the model. I created an issue in their repo (https://github.com/pytorch/torchtune/issues/1762), but it seems to me a PyTorch issue. I've seen this with both 2.4.1+cu124 and the nightly version:

python -c "import torch; print(torch.__version__); print(torch.cuda.nccl.version())"    17:44
2.6.0.dev20241008+cu124
(2, 21, 5)

The following is the command I'm running and the traceback:

TORCH_CPP_LOG_LEVEL=INFO TORCH_DISTRIBUTED_DEBUG=DETAIL NCCL_DEBUG=INFO NCCL_DEBUG_SUBSYS=INIT,ENV,GRAPH,COLL NCCL_SOCKET_IFNAME="eth0,en,eth,em,bond" tune run --nnodes 1 --nproc_per_node 3 lora_finetune_distributed --config ./recipes/mm_llama2_7B_lora.yaml

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/zak_jost/lib/python3.11/site-packages/recipes/lora_finetune_distributed.py", line 862, in <module>
[rank0]:     sys.exit(recipe_main())
[rank0]:              ^^^^^^^^^^^^^
[rank0]:   File "/home/zak_jost/lib/python3.11/site-packages/torchtune/config/_parse.py", line 99, in wrapper
[rank0]:     sys.exit(recipe_main(conf))
[rank0]:              ^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zak_jost/lib/python3.11/site-packages/recipes/lora_finetune_distributed.py", line 857, in recipe_main
[rank0]:     recipe.train()
[rank0]:   File "/home/zak_jost/lib/python3.11/site-packages/recipes/lora_finetune_distributed.py", line 823, in train
[rank0]:     self.save_checkpoint(epoch=curr_epoch)
[rank0]:   File "/home/zak_jost/lib/python3.11/site-packages/recipes/lora_finetune_distributed.py", line 618, in save_checkpoint
[rank0]:     cpu_state_dict = training.get_full_model_state_dict(
[rank0]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zak_jost/lib/python3.11/site-packages/torchtune/training/_distributed.py", line 424, in get_full_model_state_dict
[rank0]:     full_param = sharded_param.full_tensor()
[rank0]:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zak_jost/lib/python3.11/site-packages/torch/distributed/_tensor/api.py", line 511, in full_tensor
[rank0]:     redist_res = self.redistribute(
[rank0]:                  ^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zak_jost/lib/python3.11/site-packages/torch/distributed/_tensor/api.py", line 483, in redistribute
[rank0]:     return Redistribute.apply(self, device_mesh, placements, async_op)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zak_jost/lib/python3.11/site-packages/torch/autograd/function.py", line 574, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zak_jost/lib/python3.11/site-packages/torch/distributed/_tensor/_redistribute.py", line 282, in forward
[rank0]:     output = redistribute_local_tensor(
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zak_jost/lib/python3.11/site-packages/torch/distributed/_tensor/_redistribute.py", line 188, in redistribute_local_tensor
[rank0]:     new_local_tensor = current_placement._to_replicate_tensor(
[rank0]:                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zak_jost/lib/python3.11/site-packages/torch/distributed/_tensor/placement_types.py", line 234, in _to_replicate_tensor
[rank0]:     result = funcol.all_gather_tensor(
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zak_jost/lib/python3.11/site-packages/torch/distributed/_functional_collectives.py", line 203, in all_gather_tensor
[rank0]:     tensor = torch.ops._c10d_functional.all_gather_into_tensor(
[rank0]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/zak_jost/lib/python3.11/site-packages/torch/_ops.py", line 1061, in __call__
[rank0]:     return self_._op(*args, **(kwargs or {}))
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: RuntimeError: Backend nccl does not support allgather_into_tensor_coalesced
1|2|Loss: 1.743589997291565: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [13:55<00:00, 417.70s/it]
[I1007 22:53:00.305572040 TCPStoreLibUvBackend.cpp:115] [c10d - debug] Read callback failed. code:-4095 name:EOF desc:end of file
[I1007 22:53:00.335351548 TCPStoreLibUvBackend.cpp:115] [c10d - debug] Read callback failed. code:-4095 name:EOF desc:end of file
[I1007 22:53:00.370470289 TCPStoreLibUvBackend.cpp:115] [c10d - debug] Read callback failed. code:-4095 name:EOF desc:end of file
W1007 22:53:00.790000 139972489500480 torch/distributed/elastic/multiprocessing/api.py:858] Sending process 8908 closing signal SIGTERM
E1007 22:53:00.904000 139972489500480 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 8907) of binary: /home/zak_jost/bin/python3.11
Traceback (most recent call last):
  File "/home/zak_jost/bin/tune", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/zak_jost/lib/python3.11/site-packages/torchtune/_cli/tune.py", line 49, in main
    parser.run(args)
  File "/home/zak_jost/lib/python3.11/site-packages/torchtune/_cli/tune.py", line 43, in run
    args.func(args)
  File "/home/zak_jost/lib/python3.11/site-packages/torchtune/_cli/run.py", line 194, in _run_cmd
    self._run_distributed(args, is_builtin=is_builtin)
  File "/home/zak_jost/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/zak_jost/lib/python3.11/site-packages/torchtune/_cli/run.py", line 95, in _run_distributed
    run(args)
  File "/home/zak_jost/lib/python3.11/site-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/home/zak_jost/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/zak_jost/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
/home/zak_jost/lib/python3.11/site-packages/recipes/lora_finetune_distributed.py FAILED
------------------------------------------------------------
Failures:
[1]:
  time      : 2024-10-07_22:53:00
  host      : zak-jost-ray-training
  rank      : 2 (local_rank: 2)
  exitcode  : 1 (pid: 8909)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-10-07_22:53:00
  host      : zak-jost-ray-training
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 8907)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
[I1007 22:53:00.811083325 TCPStoreLibUvBackend.cpp:115] [c10d - debug] Read callback failed. code:-4095 name:EOF desc:end of file
[I1007 22:53:00.811169886 TCPStoreLibUvBackend.cpp:1002] [c10d - debug] Store exit requested

[I1007 22:53:00.811187377 TCPStoreLibUvBackend.cpp:1070] [c10d - debug] UV main loop done: res:1
[I1007 22:53:00.811200287 TCPStoreLibUvBackend.cpp:1076] [c10d - debug] Walking live handles prior to closing clients
[I1007 22:53:00.811211947 TCPStoreLibUvBackend.cpp:1059] [c10d - debug] UV live handle type 12 active:1 is-closing:0
[I1007 22:53:00.811221437 TCPStoreLibUvBackend.cpp:1086] [c10d - debug] Walking live handles after closing clients
[I1007 22:53:00.811232467 TCPStoreLibUvBackend.cpp:1059] [c10d - debug] UV live handle type 12 active:0 is-closing:1
[I1007 22:53:00.811243977 TCPStoreLibUvBackend.cpp:1095] [c10d] uv_loop_close failed with:-16 errn:EBUSY desc:resource busy or locked
[I1007 22:53:00.811275138 TCPStoreLibUvBackend.cpp:1105] [c10d] uv_loop cleanup finished.

Versions

Collecting environment information...
PyTorch version: 2.6.0.dev20241008+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (conda-forge gcc 13.3.0-1) 13.3.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.11.10 | packaged by conda-forge | (main, Sep 30 2024, 18:08:57) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.15.149-99.162.amzn2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.6.68
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA A10G
GPU 1: NVIDIA A10G

Nvidia driver version: 550.54.15
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             48
On-line CPU(s) list:                0-47
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 7R32
CPU family:                         23
Model:                              49
Thread(s) per core:                 2
Core(s) per socket:                 24
Socket(s):                          1
Stepping:                           0
BogoMIPS:                           5599.99
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch topoext ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr rdpru wbnoinvd arat npt nrip_save rdpid
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          768 KiB (24 instances)
L1i cache:                          768 KiB (24 instances)
L2 cache:                           12 MiB (24 instances)
L3 cache:                           96 MiB (6 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-47
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT enabled with STIBP protection
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] galore-torch==1.0
[pip3] numpy==1.26.4
[pip3] pytorch-triton==3.1.0+cf34004b8a
[pip3] torch==2.6.0.dev20241008+cu124
[pip3] torchao==0.5.0
[pip3] torchaudio==2.5.0.dev20241008+cu124
[pip3] torchmetrics==1.4.2
[pip3] torchtune==0.4.0.dev20241008+cpu
[pip3] torchvision==0.20.0.dev20241008+cu124
[pip3] triton==3.0.0
[conda] galore-torch              1.0                pyhd8ed1ab_1    conda-forge
[conda] libmagma                  2.7.2                h173bb3b_2    conda-forge
[conda] libmagma_sparse           2.7.2                h173bb3b_3    conda-forge
[conda] libopenvino-pytorch-frontend 2024.4.0             h5888daf_0    conda-forge
[conda] libtorch                  2.3.1           cuda120_h2b0da52_300    conda-forge
[conda] mkl                       2023.2.0         h84fe81f_50496    conda-forge
[conda] numpy                     1.26.4          py311h64a7726_0    conda-forge
[conda] pytorch-triton            3.1.0+cf34004b8a          pypi_0    pypi
[conda] torch                     2.6.0.dev20241008+cu124          pypi_0    pypi
[conda] torchao                   0.5.0                    pypi_0    pypi
[conda] torchaudio                2.5.0.dev20241008+cu124          pypi_0    pypi
[conda] torchmetrics              1.4.2              pyhd8ed1ab_0    conda-forge
[conda] torchtune                 0.4.0.dev20241008+cpu          pypi_0    pypi
[conda] torchvision               0.20.0.dev20241008+cu124          pypi_0    pypi
[conda] triton                    3.0.0                    pypi_0    pypi

cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o

zjost commented 1 month ago

Update: It seems the error has gone away if I don't use the debugging variables in the launch command, and have the nightly versions of both torchtune and pytorch.

Specifically, this works:

TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO tune run --nnodes 1 --nproc_per_node 2 lora_finetune_distributed --config ./recipes/mm_phi3_lora.yaml

And this fails:

TORCH_CPP_LOG_LEVEL=INFO TORCH_DISTRIBUTED_DEBUG=DETAIL NCCL_DEBUG=INFO tune run --nnodes 1 --nproc_per_node 2 lora_finetune_distributed --config ./recipes/mm_phi3_lora.yaml

The only difference being that I've removed TORCH_DISTRIBUTED_DEBUG=DETAIL.

I suppose this means that the real error was something different, and adding this variable caused a different problem.

I'll track the model saving issue in the torchtune issue, but the PyTorch team might be interested in the problems caused by TORCH_DISTRIBUTED_DEBUG=DETAIL, so I'll leave this open.

awgu commented 1 month ago

cc: @yifuwang @H-Huang @kwen2501

do you guys know how funcol all-gather can end up raising: https://github.com/pytorch/pytorch/blob/b41fc1407258299f7869cbc22ce586e41bea9a39/torch/csrc/distributed/c10d/Backend.hpp#L152-L166

zjost commented 1 month ago

Maybe https://github.com/pytorch/pytorch/issues/75011 is related?

awgu commented 1 month ago

Oh good point. I guess using DETAIL uses the PG wrapper, which runs the collectives first using gloo backend or something, so the error message might be misleading.

awgu commented 1 month ago

Repro:

TORCH_DISTRIBUTED_DEBUG=DETAIL pytest test/distributed/test_c10d_functional_native.py -k test_all_gather_into_tensor_coalesced

pytorch / pytorch

RuntimeError: Backend nccl does not support allgather_into_tensor_coalesced #137505

🐛 Describe the bug

Versions