Open tgaddair opened 1 year ago
Hi @tgaddair I would like to contribute to this , can you help me understand the problem and get started.
Hi @tgaddair I would also like to work on this issue, and contribute to the project!
Thanks @vishnu2411 and @yashlondhe90960 for helping out with this!
After looking at the HuggingFace implementation of IdeficsVisionTextToText, I think the changes to Ludwig shouldn't be too bad. The main things it looks like we'll need to do will be breaking out some of the functionality of the IdeficsProcessor so that the text inputs and image inputs can be prepared independently during Ludwig's preprocessing phase, and then making sure we properly feed both text and image inputs into the model during training / inference.
Specifically:
AutoTokenizer
we use for any HF model should work for this one as well, so hopefully nothing needs to change for text preprocessing. However, it does look like the IdeficsProcessor inserts some special tokens into the prompt as placeholders for the images here. So the one bit of manipulation we may need to do is something similar during Ludwig's format_data_with_prompt function if we see that the config has one or more image input features.For testing purposes, there is https://huggingface.co/HuggingFaceM4/tiny-random-idefics, which should allow us to run everything end-to-end on CPU without a lot of overhead.
That probably sounds like a lot, so happy to help start a Slack channel, have a Zoom call, or collaborate on a PR to help get things started!
That was some good information @tgaddair , I think we can go with a slack channel to discuss on this.
Great @vishnu2411 ! I think we can start simple here and try adding support for using the IdeficsVisionTextToText model with just string (public) URLs for a first version.
I created a Slack channel for us to collaborate here: #p-visual-language-models.
See you there!
Unable to join the channel as it is asking for email account with @predibase.com domain
IDEFICS: https://huggingface.co/blog/idefics
Example config: