Open dwromero opened 6 days ago
@dwromero hey David! want to try setting this to False
for now and see if that resolves your issue?
Hi @lucidrains , thank you for your fast reply. I'll try it out now.
David
@lucidrains, it works now! I do not know if this is a full solution to the problem though. Please let me know if you feel this is the case and I can close the issue.
@dwromero nice! at least it doesn't block you from your research now!
if you'd like to help me get to the bottom of this, you could turn it back to True
in 1.14.40 and share with me the stack trace once it errors again
@dwromero the other thing that would be helpful (if you have the time), is to run it with only one quantizer and see if it still errors 🙏
@dwromero hey David, realized just now the local sampling won't work, as the codes will no longer be synced
could you try again on the latest?
@dwromero hey David again
so I think your error may be related to an issue with the quantize dropout in a distributed environment, which would also make the above solution not work. i put in a potential fix, if you are still running experiments
another way to avoid this issue is to offer a way to delay the expiration of the codes until all the quantizers have been invoked
Hi @lucidrains,
So, just to clarify, I should be able to run with the same configurations as in the formulation of this thread and it should work now? Would you like me to check that?
Hi all,
I noticed that using
ResidualVQ
as:Leads to the following error:
This happens randomly during training (in a multinode setting). Any idea what the cause could be?