Generation inference with interleaved input

showlab / Show-o

Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.

https://arxiv.org/abs/2408.12528

Apache License 2.0

931 stars 40 forks source link

Generation inference with interleaved input #35

Open ys-zong opened 2 weeks ago

ys-zong commented 2 weeks ago

Hi, thanks for the nice work! I wonder if Show-o supports inference with interleaved multimodal inputs, e.g., [text 1] [image 1] [text 2] [image 2] [text 3] -> generate a new image. If so, can you provide a code snippet for this? I saw current inference code can only input one image or a pair of image-text. Many thanks!

KebinWu commented 2 weeks ago

I'm not sure if the code supports doing so, but at least I don't expect the model to perform well on such tasks, as interleaved samples are not used in the training.

Sierkinhane commented 2 weeks ago

Hi, mixed-modality generation will be released in the future but the timeline is still undetermined.