huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26k stars 5.35k forks source link

Mysterious weights when training UNET #4931

Closed yeonsikch closed 11 months ago

yeonsikch commented 1 year ago

I was training sdxl UNET base model which was going great until around step 210k when the weights suddenly turned back to their original values and stayed that way. I also tried with the ema version, which didn’t change at all. I also looked at the tensor’s weight values directly which confirmed my suspicions. Is this a bug or did most likely a mistake on my part? Has anyone experienced something similar with any other SD model?

code : diffusers/examples/text_to_image/train_text_to_image_sdxl.py command :

accelerate launch --mixed_precision="fp16" train_text_to_image_sdxl.py \
    --pretrained_model_name_or_path="stabilityai/stable-diffusion-xl-base-1.0" \
    --pretrained_vae_model_name_or_path="madebyollin/sdxl-vae-fp16-fix" \
    --seed=21 \
    --image_column="image" \
    --caption_column="caption" \
    --dataset_name="{{huggingface_dataset_path}}" \
    --validation_prompt="1girl, bangs, black_hair, blunt_bangs, japanese_clothes, kimono, long_sleeves, looking_at_viewer, obi, red_kimono, sash, short_hair, simple_background, smile, solo, upper_body, white_background" \
    --use_ema \
    --random_flip \
    --train_batch_size=2 \
    --max_train_steps=800000 \
    --learning_rate=5e-5 \
    --max_grad_norm=1 \
    --lr_scheduler="linear" \
    --lr_warmup_steps=50000 \
    --output_dir="/app/outputs"\
    --huggingface_repo="{{huggingface_save_path}}" \
    --report_to wandb \
    --push_to_hub \
    --checkpointing_steps=10000 \
    --validation_epochs=1 \
    --dataloader_num_workers=8 \
    --snr_gamma=5 \
    --force_snr_gamma \
    --enable_xformers_memory_efficient_attention

example plot (same seed): sdxl_weight_issue

sayakpaul commented 1 year ago

That is indeed very weird. Could you confirm if this happens for the SD under same settings?

Does this also happen when you train for a smaller number of steps?

yeonsikch commented 1 year ago

That is indeed very weird. Could you confirm if this happens for the SD under same settings?

Does this also happen when you train for a smaller number of steps?

I didn't train for a smaller number of steps yet. But I'm training SD1.5 now. After training 210k steps, I'll reply.

Thanks!

bram-w commented 1 year ago

EDIT: The below is probably unrelated to the above issue, but I'll keep it up in case anyone runs into a similar situation and finds this. It turns out it was because I added in another model (frozen version of the original) which this line of code https://github.com/huggingface/diffusers/blob/v0.20.0-release/examples/text_to_image/train_text_to_image.py#L635 was causing to overwrite the actual trained model immediately after saving.

What happens if you check the md5sum of the unet checkpoints? I'm running into a similar issue - I adapted the training script to run on a specialized objective and the loss is improving (implying the unet parameters are changing) but all my checkpoints are identical (even the md5sum)

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.