gradio-app / gradio

Build and share delightful machine learning apps, all in Python. 🌟 Star to support our work!
http://www.gradio.app
Apache License 2.0
34.36k stars 2.61k forks source link

[Feature Request] Supports for the input with interleaved text and images #6028

Closed ghrua closed 1 year ago

ghrua commented 1 year ago

Is your feature request related to a problem? Please describe.

In the past two months, many new works have emerged in the multimodal direction [1-4]. An important feature of these works is that they allow users to include any interleaved images and text in the input. However, the current interaction method of Gradio is to let users input an image into a fixed image box, resulting in very poor flexibility. For example:

  1. Users may include one, two, or multiple images in the input. The existing Gradio requires a fixed number of text boxes to be pre-arranged on the web page, which leads to a lot of redundancy in the page layout.
  2. The position of the image in the user's input is important information, but the current Gradio loses this information.

Describe the solution you'd like

In our TextBind work, we have implemented a chat tool for image-text interaction that is more like a natural conversation: https://ailabnlp.tencent.com/research_demos/textbind/. However, since we are not professional web developers, this demo is not very robust. But some examples of using this demo can already show the flexibility brought by natural interaction: https://textbind.github.io/

We think that the multimodal LLMs that allow inputs with interleaved text and images will be a standard in the future. Therefore, we sincerely hope that the Gradio team can consider this feature. Additional context

Below is a list of related works (Since I may miss some works, please feel free to include your work in this thread)
 
[1]: TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild, https://arxiv.org/abs/2309.08637 [2]: NExT-GPT: Any-to-Any Multimodal LLM, https://arxiv.org/abs/2309.05519 [3]: DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention, https://arxiv.org/abs/2309.14327 [4]: MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens, https://arxiv.org/abs/2310.02239

abidlabs commented 1 year ago

Very nice component request @ghrua! We're working on making it possible for Gradio users to create their own custom components -- meaning that you'll be able to take an existing Gradio component and clone it, modify the backend or the frontend, and use it in your Gradio apps. What you're describing would be a great candidate for a custom component. If you're interested, I can share with you the current instructions for making a custom component and you could give it a shot. What do you think?

ghrua commented 1 year ago

Hi @abidlabs, thanks for your prompt reply! Your suggestion sounds very nice to me and I'd like to try to make a custom component! 😊

abidlabs commented 1 year ago

Ok I'll share something in the next couple of days!

abidlabs commented 1 year ago

Here's instructions that we put together: https://github.com/gradio-app/gradio/wiki/%F0%9F%8E%A8-How-to-Make-a-Gradio-Custom-Component

Please let me know if you have any questions, happy to help!

ghrua commented 1 year ago

@abidlabs Thanks for your kind information! Let me have a try :)

abidlabs commented 1 year ago

Hi @ghrua! We've now made it official, all Gradio users to create their own custom components -- meaning that you can write some Python and JavaScript (Svelte), and publish it as a Gradio component. You can use it in your own Gradio apps, or share it so that anyone can use it in their Gradio apps. Here are some examples of custom Gradio components:

You can see the source code for those components by clicking the "Files" icon and then clicking "src". The complete source code for the backend and frontend is visible. We've put together an improved version of the Guide I sent before: https://www.gradio.app/guides/five-minute-guide, and we're happy to help. Let us know if you have any questions!

abidlabs commented 1 year ago

I'll go ahead and close this issue since we are not planning to include this in the core Gradio library. But happy to help if you are interested in making this a custom Gradio component (feel free to ask questions in this issue).