[SD3] vae.config.shift_factor missing in dreambooth training examples

huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.

https://huggingface.co/docs/diffusers

Apache License 2.0

23.86k stars 4.92k forks source link

[SD3] vae.config.shift_factor missing in dreambooth training examples #8708

Open bendanzzc opened 6 days ago

bendanzzc commented 6 days ago

Describe the bug

shift_factor missing in traning code: https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora_sd3.py#L1617, but used in inference code: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py#L893 Is it resonable that when traning SD3, we do not need to norm latents using vae.config.shift_factor and scale_factor?

Thinks

Reproduction

None

Logs

No response

System Info

None

Who can help?

@sayakpaul

sayakpaul commented 6 days ago

Good observation! Thank you for bringing this up!

Yeah, ideally, a reversal of the following would be needed: https://github.com/huggingface/diffusers/blob/0f0b531827900d805f8d2d0a42c1040a1e34bf07/src/diffusers/pipelines/stable_diffusion_3/pipeline_stable_diffusion_3.py#L893

Do you want to give it a try and open a PR, perhaps? Happy to help you through the process.

bendanzzc commented 6 days ago

Thanks, I'd like to try

sayakpaul commented 6 days ago

Lovely. Thanks so much.

CodeExplode commented 3 days ago

I've just implemented this in my own training code which is largely based on the diffusers example, and it does seem to noticeably help with image crispness in some with/without tests on the same training data (though non-deterministic choice of seed, image ordering, prompt shuffling, etc).

sayakpaul commented 3 days ago

Do you wanna show some comparisons?

CodeExplode commented 3 days ago

I only kept one image sorry. I tried training with the character Ahsoka as the toughest example in my dataset.

The left is training without handling shift, right is with handling shift, on approximately the same prompt (with some shuffling) at about the same number of steps. Without applying shift they all looked blurry like the left sample (after a few epochs), whereas with shift there was a mix of blurry and crisp previews, so it seemed to be helping. The samples were always generated with shift from the start as they use different code.

SD3_AhsokaTrainingExample

This was full finetuning, rather than LoRA training.