jy0205 / LaVIT

LaVIT: Empower the Large Language Model to Understand and Generate Visual Content
Other
438 stars 22 forks source link

Training detail of codebook #29

Open Faded1022 opened 1 month ago

Faded1022 commented 1 month ago

Thanks for your work, I'm interested in your work, I tried to reproduce Dynamic Visual Tokenizer , but the reconstruction loss is around 0.3, can you give me some suggestions for training? Thanks

jy0205 commented 1 month ago

Hi, thanks for your attention! Here are some tricks we used in our training:

  1. reduce the codebook dim (32 is enough); use the k-means initialization for the codebook.
  2. use the EMA update instead of direct gradient training for the codebook.
  3. Train the selector and merger first (It means you need to train without quantization).
  4. After training the selector and merger! Open the vector quantization and update all modules end-to-end. (learn codebook)