Closed jihnenglin closed 10 months ago
Yes, this is the correct way to fix this issue.
I've tested this, and gradients are synchronized when nccl:all_reduce
is called in accelerator.backward(loss)
.
Alright, so after merging #1000 and fixing issue #1002, I believe the gradient asynchrony issue has been solved. Here's what I found for my run, compared with the previous run with the same settings:
Great job, @Isotr0py!
Similar to issue #994, but happened while attempting to generate sample images in
sdxl_train.py
.The error message:
AttributeError: 'DistributedDataParallel' object has no attribute 'text_projection'
The issue can be fixed by unwrapping the model, i.e., change
to
But I'm not sure if this workaround would potentially break the gradient synchronization issue mentioned in #994. @Isotr0py I'd be very grateful if you could run a sanity check on this.