Closed haibo-qiu closed 4 months ago
Yes, we use continuous features only for understanding tasks (text generation). For all the image generation, the prompts (includes image and text) are tokenized into discrete tokens. You can refer to the code https://github.com/jy0205/LaVIT/blob/228e39196bbbc2ee7eab57c545cdb9a455d21768/LaVIT/models/lavit_for_generation.py#L518 for more details.
I really appreciate your response, it was of great help to me.
Hi @jy0205,
Thank you for your excellent work! I would appreciate it if you could provide more details about image generation with multi-modal prompts. I noticed that you mentioned in Appendix A:
So if the prompt includes both image and text, do you tokenize them all into discrete tokens? I am a bit confused because in Section 3.2 you stated:
Therefore, when the image is included in the prompt, do you use the continuous feature?
Or do you actually use continuous or discrete features based on the type of task — for instance, using continuous features for understanding (text generation) and discrete token features for image generation?
Thanks again!