Closed aluminumbox closed 4 months ago
@aluminumbox hi Xiang
you can actually force ddp all reduce to be turned off by setting sync_codebook = False
that's interesting it hangs though, is fairseq
being used, as in the other issue? i'm not familiar with torchrun
@aluminumbox hi Xiang
you can actually force ddp all reduce to be turned off by setting
sync_codebook = False
that's interesting it hangs though, is
fairseq
being used, as in the other issue? i'm not familiar with torchrun
Thanks for the advice. I tried sync_codebook = False
and everything is okay now. Yes when I run torchrun train.py that contains this package, the ddp hangs for 30mins then timeout. I am using another toolkit named wenet
, and running the code in torchrun multi node mode. It maybe helpful to add this tip in annotation to help others who might also be stuck with torchrun.
@aluminumbox are you using gloo or nccl?
@aluminumbox are you using gloo or nccl?
I am using nccl
in torchrun mutli node mode and it hangs for 30mins.
If I use gloo
, it gives error information.
terminate called after throwing an instance of 'gloo::EnforceNotMet' what(): [enforce fail at ../third_party/gloo/gloo/transport/tcp/pair.cc:446] op.preamble.length <= op.nbytes. 971396 vs 256
Should I close this issue as it can be solved by setting sync_codebook = False
?
@aluminumbox what happens if you set sync_codebook = True
with nccl
?
@aluminumbox what happens if you set
sync_codebook = True
withnccl
?
Sorry for the late reply. If i use nccl
, it will hang for like 30mins than timeout, no detail error log. If I use gloo
, it gives above error information op.preamble.length <= op.nbytes. 971396 vs 256
Hi, I've noticed that there is some distributed.all_reduce code in vector-quantize-pytorch. If I use this package in some ddp wraped code, for example, I run torchrun train.py, and train.py contains vector-quantize-pytorch code, the ddp will hang forever.
Same problem has been observed in this issue #30
I wonder if in torchrun mode, can I just simply remove all the distributed.all_reduce and let self.kmeans_all_reduce_fn=noop, self.all_reduce_fn=noop.
Currently this is my solution and I can let the model continue training in torchrun. But I worrry that this might leads to incorrect result. Hope to know the correct way to let this package work in torchrun.
Thanks!