About multimodal sequence input

showlab / Show-o

Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.

https://arxiv.org/abs/2408.12528

Apache License 2.0

1.03k stars 44 forks source link

About multimodal sequence input #38

Open tulvgengenr opened 1 month ago

tulvgengenr commented 1 month ago

Hello, I am very interested in your great work. I see in the code that the sequence of the image generation input is basically text tokens before image tokens, what about reversing the order when generating the image?

Sierkinhane commented 1 month ago

Hi, we did not try that.

zc1023 commented 1 month ago

Hello, there is something strange for me about multimodal sequence input in mmu. In embedding input, the sequence is [system embedding, image embedding, question embedding]. However, in tokens input, the sequence is [question token, image token]. Does the input order not matter?

Sierkinhane commented 1 month ago

Hi, for continuous clip-vit features, we follow llava's processing. In our experiments, it seems that the order does not matter a lot.

zc1023 commented 1 month ago

Hi, for continuous clip-vit features, we follow llava's processing. In our experiments, it seems that the order does not matter a lot.

This result is quite interesting. I'd like to know which input order while training.