Closed tedfeng424 closed 6 months ago
Hi, thanks for your attention to our work. For the understanding task, we use the continuous visual embeddings from the token merger (as explained in the paper). Thus, it needs to load the updated weight (Stage-2 Pre-training) of the token merger. For the text task, we directly use the tokenizer trained in Stage-1 to tokenize the image into discrete tokens for the auto-regressive generation in LLM.
Thank you for clarifying. Just to follow up from this, what modules are optimized in Stage-2 Pre-training, the language model, token merger, and anything else?
Yes, during stage 2 Pretraining, only the language model, token merger, and a projector (mapping to 4096) are optimized.
Thank you for the clarification!
Hello,
Really awesome work!
I noticed in the code that the visual tokenizers are loaded differently for generation and understanding tasks. What are the differences between them? Is it that the tokenizer for understanding lacks the quantization step? Are they trained differently?
Thanks