Open ys-zong opened 2 weeks ago
I'm not sure if the code supports doing so, but at least I don't expect the model to perform well on such tasks, as interleaved samples are not used in the training.
Hi, mixed-modality generation will be released in the future but the timeline is still undetermined.
Hi, thanks for the nice work! I wonder if Show-o supports inference with interleaved multimodal inputs, e.g., [text 1] [image 1] [text 2] [image 2] [text 3] -> generate a new image. If so, can you provide a code snippet for this? I saw current inference code can only input one image or a pair of image-text. Many thanks!