Add support for visual language model fine-tuning (IDEFICS, etc.)

tgaddair commented 1 year ago

IDEFICS: https://huggingface.co/blog/idefics

Example config:

input_features:
  - name: prompt
    type: text
  - name: img
    type: image
output_features:
  - name: response
    type: text

vishnu2411 commented 1 year ago

Hi @tgaddair I would like to contribute to this , can you help me understand the problem and get started.

yashlondhe90960 commented 1 year ago

Hi @tgaddair I would also like to work on this issue, and contribute to the project!

tgaddair commented 1 year ago

Thanks @vishnu2411 and @yashlondhe90960 for helping out with this!

After looking at the HuggingFace implementation of IdeficsVisionTextToText, I think the changes to Ludwig shouldn't be too bad. The main things it looks like we'll need to do will be breaking out some of the functionality of the IdeficsProcessor so that the text inputs and image inputs can be prepared independently during Ludwig's preprocessing phase, and then making sure we properly feed both text and image inputs into the model during training / inference.

Specifically:

The existing AutoTokenizer we use for any HF model should work for this one as well, so hopefully nothing needs to change for text preprocessing. However, it does look like the IdeficsProcessor inserts some special tokens into the prompt as placeholders for the images here. So the one bit of manipulation we may need to do is something similar during Ludwig's format_data_with_prompt function if we see that the config has one or more image input features.
We will likely need a new Image Encoder for Idefics that looks very similar to the VitEncoder in Ludwig. The purpose of the encoder would be to resize / crop the input image to fit the size expected by Idefics. The forward function itself can be a no-op, as we just want to feed the raw pixel data into the main Idefics model.
We'll need to refactor the LLM class in Ludwig to handle multiple inputs / VisionTextToText models in addition to CausalLM models. This can be some simple if-else statements for now (e.g., look at the AutoConfig, if the model is IdeficsVisionTextToText then use one implementation, else use the other).
We'll need to relax constraints in the Ludwig schema requiring only one text input feature for LLMs. We'll need to also support image features for LLMs in the schema.

For testing purposes, there is https://huggingface.co/HuggingFaceM4/tiny-random-idefics, which should allow us to run everything end-to-end on CPU without a lot of overhead.

That probably sounds like a lot, so happy to help start a Slack channel, have a Zoom call, or collaborate on a PR to help get things started!

vishnu2411 commented 1 year ago

That was some good information @tgaddair , I think we can go with a slack channel to discuss on this.

tgaddair commented 1 year ago

Great @vishnu2411 ! I think we can start simple here and try adding support for using the IdeficsVisionTextToText model with just string (public) URLs for a first version.

I created a Slack channel for us to collaborate here: #p-visual-language-models.

See you there!

vishnu2411 commented 1 year ago

Unable to join the channel as it is asking for email account with @predibase.com domain

ludwig-ai / ludwig

Add support for visual language model fine-tuning (IDEFICS, etc.) #3656