Difference between the visual tokenizer for Generation task and Understanding tasks

jy0205 / LaVIT

LaVIT: Empower the Large Language Model to Understand and Generate Visual Content

Other

438 stars 22 forks source link

Difference between the visual tokenizer for Generation task and Understanding tasks #9

Closed tedfeng424 closed 6 months ago

tedfeng424 commented 6 months ago

Hello,

Really awesome work!

I noticed in the code that the visual tokenizers are loaded differently for generation and understanding tasks. What are the differences between them? Is it that the tokenizer for understanding lacks the quantization step? Are they trained differently?

Thanks

jy0205 commented 6 months ago

Hi, thanks for your attention to our work. For the understanding task, we use the continuous visual embeddings from the token merger (as explained in the paper). Thus, it needs to load the updated weight (Stage-2 Pre-training) of the token merger. For the text task, we directly use the tokenizer trained in Stage-1 to tokenize the image into discrete tokens for the auto-regressive generation in LLM.

tedfeng424 commented 6 months ago

Thank you for clarifying. Just to follow up from this, what modules are optimized in Stage-2 Pre-training, the language model, token merger, and anything else?

jy0205 commented 6 months ago

Yes, during stage 2 Pretraining, only the language model, token merger, and a projector (mapping to 4096) are optimized.

tedfeng424 commented 6 months ago

Thank you for the clarification!