Open zjost opened 1 month ago
Update: It seems the error has gone away if I don't use the debugging variables in the launch command, and have the nightly versions of both torchtune and pytorch.
Specifically, this works:
TORCH_CPP_LOG_LEVEL=INFO NCCL_DEBUG=INFO tune run --nnodes 1 --nproc_per_node 2 lora_finetune_distributed --config ./recipes/mm_phi3_lora.yaml
And this fails:
TORCH_CPP_LOG_LEVEL=INFO TORCH_DISTRIBUTED_DEBUG=DETAIL NCCL_DEBUG=INFO tune run --nnodes 1 --nproc_per_node 2 lora_finetune_distributed --config ./recipes/mm_phi3_lora.yaml
The only difference being that I've removed TORCH_DISTRIBUTED_DEBUG=DETAIL
.
I suppose this means that the real error was something different, and adding this variable caused a different problem.
I'll track the model saving issue in the torchtune issue, but the PyTorch team might be interested in the problems caused by TORCH_DISTRIBUTED_DEBUG=DETAIL
, so I'll leave this open.
cc: @yifuwang @H-Huang @kwen2501
do you guys know how funcol all-gather can end up raising: https://github.com/pytorch/pytorch/blob/b41fc1407258299f7869cbc22ce586e41bea9a39/torch/csrc/distributed/c10d/Backend.hpp#L152-L166
Maybe https://github.com/pytorch/pytorch/issues/75011 is related?
Oh good point. I guess using DETAIL
uses the PG wrapper, which runs the collectives first using gloo backend or something, so the error message might be misleading.
Repro:
TORCH_DISTRIBUTED_DEBUG=DETAIL pytest test/distributed/test_c10d_functional_native.py -k test_all_gather_into_tensor_coalesced
đ Describe the bug
I am using torchtune and receive the error in the title whenever it goes to save the model. I created an issue in their repo (https://github.com/pytorch/torchtune/issues/1762), but it seems to me a PyTorch issue. I've seen this with both 2.4.1+cu124 and the nightly version:
The following is the command I'm running and the traceback:
Versions
cc @XilunWu @H-Huang @awgu @kwen2501 @wanchaol @fegin @fduwjj @wz337 @wconstab @d4l3k @c-p-i-o