[Feature Request] Supports for the input with interleaved text and images

ghrua commented 1 year ago

[x] I have searched to see if a similar issue already exists.

Is your feature request related to a problem? Please describe.

In the past two months, many new works have emerged in the multimodal direction [1-4]. An important feature of these works is that they allow users to include any interleaved images and text in the input. However, the current interaction method of Gradio is to let users input an image into a fixed image box, resulting in very poor flexibility. For example:

Users may include one, two, or multiple images in the input. The existing Gradio requires a fixed number of text boxes to be pre-arranged on the web page, which leads to a lot of redundancy in the page layout.
The position of the image in the user's input is important information, but the current Gradio loses this information.

Describe the solution you'd like

In our TextBind work, we have implemented a chat tool for image-text interaction that is more like a natural conversation: https://ailabnlp.tencent.com/research_demos/textbind/. However, since we are not professional web developers, this demo is not very robust. But some examples of using this demo can already show the flexibility brought by natural interaction: https://textbind.github.io/

We think that the multimodal LLMs that allow inputs with interleaved text and images will be a standard in the future. Therefore, we sincerely hope that the Gradio team can consider this feature. Additional context

Below is a list of related works (Since I may miss some works, please feel free to include your work in this thread)   [1]: TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild, https://arxiv.org/abs/2309.08637 [2]: NExT-GPT: Any-to-Any Multimodal LLM, https://arxiv.org/abs/2309.05519 [3]: DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention, https://arxiv.org/abs/2309.14327 [4]: MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens, https://arxiv.org/abs/2310.02239