hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All
https://hpcaitech.github.io/Open-Sora/
Apache License 2.0
21.76k stars 2.1k forks source link

regarding setting scaling_factor=0.18215 instead of 0.13025 in stage 1 vae training #493

Closed erliding closed 3 months ago

erliding commented 3 months ago

Dear open-sora,

I see vae_2d is init from "PixArt-alpha/pixart_sigma_sdxlvae_T5_diffusers" which should have scaling_factor=0.13025, but in the code sd-v15's scaling_factor=0.18215 is instead hard coded every where for example: https://github.com/hpcaitech/Open-Sora/blob/9c4444207f18e6cf851e8cbac689f32bef762075/opensora/models/vae/vae.py#L35 and this value seems to be used for the stage 1 training of VAE_Temporal, i'm wondering if this is on purpose or a bug? This could cause input std for vae_temporal not as normalized to 1 as when applying 0.13025, but seems doesn't have other obvious impact, as there are also additional scale and shift applied to 3d latent in the end

zhengzangw commented 3 months ago

Thank you for your question. As we checked, we found that we made a mistake here. The reason is:

  1. In Open-Sora 1.0 & 1.1, we use VAE from SD 1.5, and thus hard-coded 0.18215, and forget to change to new scale.
  2. However, the training can still be done, and for diffusion training, we normalize the output channel-wisely.

Thus, if you want to use our VAE, keep the value to 0.18215. Or if you want to train the VAE from scratch using our code, I suggest you change the scale.