Open tulvgengenr opened 1 month ago
Hi, we did not try that.
Hello, there is something strange for me about multimodal sequence input in mmu. In embedding input, the sequence is [system embedding, image embedding, question embedding]. However, in tokens input, the sequence is [question token, image token]. Does the input order not matter?
Hi, for continuous clip-vit features, we follow llava's processing. In our experiments, it seems that the order does not matter a lot.
Hi, for continuous clip-vit features, we follow llava's processing. In our experiments, it seems that the order does not matter a lot.
This result is quite interesting. I'd like to know which input order while training.
Hello, I am very interested in your great work. I see in the code that the sequence of the image generation input is basically text tokens before image tokens, what about reversing the order when generating the image?