jy0205 / LaVIT

LaVIT: Empower the Large Language Model to Understand and Generate Visual Content
Other
506 stars 26 forks source link

question about the DYNAMIC VISUAL TOKENIZER #10

Closed bpwl0121 closed 8 months ago

bpwl0121 commented 8 months ago

hi,

thanks for your awesome work, I have a question regarding the loss func in DYNAMIC VISUAL TOKENIZER. the codebook generate the discrete token Vq, how do you train your DYNAMIC VISUAL TOKENIZER by the cosine similarity and the Mask loss since part of it is discrete

thx

jy0205 commented 8 months ago

Thanks for your attention~ We use the Gumbel-Softmax trick to maintain the gradient passing during training.

bpwl0121 commented 8 months ago

Thanks for your attention~ We use the Gumbel-Softmax trick to maintain the gradient passing during training.

Gumbel-Softmax trick does its job for sure, but it only relate to the Mask loss, I think. how is the codebook trained? cause you do not use the loss like vq-vae, the common loss for codebook training image

jy0205 commented 8 months ago

Sorry for the late reply~ In our implementation, we first train the token predictor and freeze other modules (without feature quantize), which aims to train a good predictor to select the most informative patches. Then, we freeze the token predictor and conduct the joint VQ training with the objective of Eq.4 to update other modules in the paper.

bpwl0121 commented 8 months ago

Sorry for the late reply~ In our implementation, we first train the token predictor and freeze other modules (without feature quantize), which aims to train a good predictor to select the most informative patches. Then, we freeze the token predictor and conduct the joint VQ training with the objective of Eq.4 to update other modules in the paper.

ah, I see thanks for your reply 👍