DDP hang when using vector-quantize-pytorch within torchrun

lucidrains / vector-quantize-pytorch

Vector (and Scalar) Quantization, in Pytorch

MIT License

2.12k stars 179 forks source link

DDP hang when using vector-quantize-pytorch within torchrun #105

Closed aluminumbox closed 4 months ago

aluminumbox commented 5 months ago

Hi, I've noticed that there is some distributed.all_reduce code in vector-quantize-pytorch. If I use this package in some ddp wraped code, for example, I run torchrun train.py, and train.py contains vector-quantize-pytorch code, the ddp will hang forever.

Same problem has been observed in this issue #30

I wonder if in torchrun mode, can I just simply remove all the distributed.all_reduce and let self.kmeans_all_reduce_fn=noop, self.all_reduce_fn=noop.

Currently this is my solution and I can let the model continue training in torchrun. But I worrry that this might leads to incorrect result. Hope to know the correct way to let this package work in torchrun.

Thanks!

lucidrains commented 4 months ago

@aluminumbox hi Xiang

you can actually force ddp all reduce to be turned off by setting sync_codebook = False

that's interesting it hangs though, is fairseq being used, as in the other issue? i'm not familiar with torchrun

aluminumbox commented 4 months ago

@aluminumbox hi Xiang

you can actually force ddp all reduce to be turned off by setting sync_codebook = False

that's interesting it hangs though, is fairseq being used, as in the other issue? i'm not familiar with torchrun

Thanks for the advice. I tried sync_codebook = False and everything is okay now. Yes when I run torchrun train.py that contains this package, the ddp hangs for 30mins then timeout. I am using another toolkit named wenet, and running the code in torchrun multi node mode. It maybe helpful to add this tip in annotation to help others who might also be stuck with torchrun.

lucidrains commented 4 months ago

@aluminumbox are you using gloo or nccl?

aluminumbox commented 4 months ago

@aluminumbox are you using gloo or nccl?

I am using nccl in torchrun mutli node mode and it hangs for 30mins. If I use gloo, it gives error information.

terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at ../third_party/gloo/gloo/transport/tcp/pair.cc:446] op.preamble.length <= op.nbytes. 971396 vs 256

Should I close this issue as it can be solved by setting sync_codebook = False?

lucidrains commented 4 months ago

@aluminumbox what happens if you set sync_codebook = True with nccl?

aluminumbox commented 4 months ago

@aluminumbox what happens if you set sync_codebook = True with nccl?

Sorry for the late reply. If i use nccl, it will hang for like 30mins than timeout, no detail error log. If I use gloo, it gives above error information op.preamble.length <= op.nbytes. 971396 vs 256