huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.19k stars 26.6k forks source link

Major VLM tracker (standardize the API) #33948

Open zucchini-nlp opened 6 days ago

zucchini-nlp commented 6 days ago

Feature request

This will track general plans on VLM and composite models so that we can align with work in TGI and other libraries. I already have some trackers so in this one I'll lay out a more bigger picture with links to respective discussions/topics

Motivation

We already have a pretty good working standards when it comes to language models, and when adding a new model usually a few "copy from" statements will do the work. We also cover most cases for LMs in out test suite. But for wave of multimodal models we still lack any form of standardization and uniform API. Each new model added to the library introduces something new, that forces us to accept it as is until we figure out how to handle it later

So we need to try to standardize those models, currently starting from VLMs. VLMs are the most commonly added models currently, but we may have more audio+text or pure multimodal ones in the future. For now we start off by working on VLM and see how things fit in the general API

Your contribution

The major changes we are working on and planning to work are:

zucchini-nlp commented 6 days ago

cc @ArthurZucker , here is the general plan I have. Let me know if something is missing or not very clear 😄

Feedback/ideas are welcome :D