Open mephisto28 opened 10 months ago
With above mentioned line not deleted:
With above mentioned line deleted:
Thank you for opening the issue.
I have not directly developed the training code for multiple GPUs, and have received PRs.
Would these PRs be a reference? #165 and #448
https://github.com/kohya-ss/sd-scripts/blob/2a23713f71628b2d1b88a51035b3e4ee2b5dbe46/fine_tune.py#L247
I have not idea what this line is used for, but this unwrap DDP module so that the training process become unsynchronized, i.e, no gradient communication in multi-gpu training, each node trains independently with part of the data.
I ensure this by adding sleep in one of the worker and finding no hang in the main training process. And by deleting this line the job got properly synchronized.