[Bug] Gradients not synchronized

kohya-ss / sd-scripts

Apache License 2.0

4.97k stars 835 forks source link

[Bug] Gradients not synchronized #924

Open mephisto28 opened 10 months ago

mephisto28 commented 10 months ago

https://github.com/kohya-ss/sd-scripts/blob/2a23713f71628b2d1b88a51035b3e4ee2b5dbe46/fine_tune.py#L247

I have not idea what this line is used for, but this unwrap DDP module so that the training process become unsynchronized, i.e, no gradient communication in multi-gpu training, each node trains independently with part of the data.

I ensure this by adding sleep in one of the worker and finding no hang in the main training process. And by deleting this line the job got properly synchronized.

mephisto28 commented 10 months ago

With above mentioned line not deleted:

With above mentioned line deleted:

kohya-ss commented 10 months ago

Thank you for opening the issue.

I have not directly developed the training code for multiple GPUs, and have received PRs.

Would these PRs be a reference? #165 and #448