Deepspeed checkpoints are completely undertrained

storuky commented 4 months ago

Hey @kohya-ss and @BootsofLagrangian Thank you for your hard work on Deepspeed integration! As I see this feature in dev branch.

I tried to use deepspeed in my pipelines but got some issues. It seems that result checkpoints are undertrained with deepspeed. Can it be related to non-synchronised gradients issue? I'm currently in dev branch with the latest version. Here is one image in dataset: 228-image-227 And here are results of same setting with and without deepspeed: xyz_grid-0004-3934512569

The difference between these 2 versions only in 3 flags:

--deepspeed --zero_stage=2 --offload_optimizer_device="cpu"

and accelerate reconfiguration.

I'm using 3xH100 gpus for experiments and LION optimizer. And in my understanding this result seems like it was actually trained on 1 GPU (but all 3 gpus was utilized 100%)

BootsofLagrangian commented 4 months ago

First, thanks to report issue. Is there similar phenomena on different dataset?

storuky commented 4 months ago

@BootsofLagrangian yes, on all datasets when using LION optimizer. I'm not sure, maybe LION optimizer should not work as good as Adam's optimizers with Deepspeed... But it doesn't break a training process so I expect it to work...

BootsofLagrangian commented 4 months ago

@BootsofLagrangian yes, on all datasets when using LION optimizer. I'm not sure, maybe LION optimizer should not work as good as Adam's optimizers with Deepspeed... But it doesn't break a training process so I expect it to work...

Yes, DeepSpeed officially supports Adam, AdamW and Fused Adam only. The behavior of other optimizer is unpredictable.

Can you try DeepSpeed with Adam or AdamW? not 8bit or bitsandbytes family.

kohya-ss / sd-scripts

Deepspeed checkpoints are completely undertrained #1217