Closed ghrua closed 1 year ago
Very nice component request @ghrua! We're working on making it possible for Gradio users to create their own custom components -- meaning that you'll be able to take an existing Gradio component and clone it, modify the backend or the frontend, and use it in your Gradio apps. What you're describing would be a great candidate for a custom component. If you're interested, I can share with you the current instructions for making a custom component and you could give it a shot. What do you think?
Hi @abidlabs, thanks for your prompt reply! Your suggestion sounds very nice to me and I'd like to try to make a custom component! 😊
Ok I'll share something in the next couple of days!
Here's instructions that we put together: https://github.com/gradio-app/gradio/wiki/%F0%9F%8E%A8-How-to-Make-a-Gradio-Custom-Component
Please let me know if you have any questions, happy to help!
@abidlabs Thanks for your kind information! Let me have a try :)
Hi @ghrua! We've now made it official, all Gradio users to create their own custom components -- meaning that you can write some Python and JavaScript (Svelte), and publish it as a Gradio component. You can use it in your own Gradio apps, or share it so that anyone can use it in their Gradio apps. Here are some examples of custom Gradio components:
You can see the source code for those components by clicking the "Files" icon and then clicking "src". The complete source code for the backend and frontend is visible. We've put together an improved version of the Guide I sent before: https://www.gradio.app/guides/five-minute-guide, and we're happy to help. Let us know if you have any questions!
I'll go ahead and close this issue since we are not planning to include this in the core Gradio library. But happy to help if you are interested in making this a custom Gradio component (feel free to ask questions in this issue).
Is your feature request related to a problem? Please describe.
In the past two months, many new works have emerged in the multimodal direction [1-4]. An important feature of these works is that they allow users to include any interleaved images and text in the input. However, the current interaction method of Gradio is to let users input an image into a fixed image box, resulting in very poor flexibility. For example:
Describe the solution you'd like
In our TextBind work, we have implemented a chat tool for image-text interaction that is more like a natural conversation: https://ailabnlp.tencent.com/research_demos/textbind/. However, since we are not professional web developers, this demo is not very robust. But some examples of using this demo can already show the flexibility brought by natural interaction: https://textbind.github.io/
We think that the multimodal LLMs that allow inputs with interleaved text and images will be a standard in the future. Therefore, we sincerely hope that the Gradio team can consider this feature. Additional context
Below is a list of related works (Since I may miss some works, please feel free to include your work in this thread) [1]: TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild, https://arxiv.org/abs/2309.08637 [2]: NExT-GPT: Any-to-Any Multimodal LLM, https://arxiv.org/abs/2309.05519 [3]: DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention, https://arxiv.org/abs/2309.14327 [4]: MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens, https://arxiv.org/abs/2310.02239