kohya-ss / sd-scripts

Apache License 2.0
5.15k stars 859 forks source link

Dev branch `sample_images` crashes during multi-GPU training #1002

Closed jihnenglin closed 10 months ago

jihnenglin commented 10 months ago

Similar to issue #994, but happened while attempting to generate sample images in sdxl_train.py.

The error message: AttributeError: 'DistributedDataParallel' object has no attribute 'text_projection'

The issue can be fixed by unwrapping the model, i.e., change

sdxl_train_util.sample_images(
    ...,
    [text_encoder1, text_encoder2],
    unet,
)

to

sdxl_train_util.sample_images(
    ...,
    [accelerator.unwrap_model(text_encoder1), accelerator.unwrap_model(text_encoder2)],
    accelerator.unwrap_model(unet),
)

But I'm not sure if this workaround would potentially break the gradient synchronization issue mentioned in #994. @Isotr0py I'd be very grateful if you could run a sanity check on this.

Isotr0py commented 10 months ago

Yes, this is the correct way to fix this issue.

I've tested this, and gradients are synchronized when nccl:all_reduce is called in accelerator.backward(loss).

jihnenglin commented 10 months ago

Alright, so after merging #1000 and fixing issue #1002, I believe the gradient asynchrony issue has been solved. Here's what I found for my run, compared with the previous run with the same settings:

  1. Lower loss for every step
  2. The output sample images got way better. The sample images in the previous run look broken now to me. The details were cranky and the texture was crude.

Great job, @Isotr0py!