Major VLM tracker (standardize the API)

Feature request

This will track general plans on VLM and composite models so that we can align with work in TGI and other libraries. I already have some trackers so in this one I'll lay out a more bigger picture with links to respective discussions/topics

Motivation

We already have a pretty good working standards when it comes to language models, and when adding a new model usually a few "copy from" statements will do the work. We also cover most cases for LMs in out test suite. But for wave of multimodal models we still lack any form of standardization and uniform API. Each new model added to the library introduces something new, that forces us to accept it as is until we figure out how to handle it later

So we need to try to standardize those models, currently starting from VLMs. VLMs are the most commonly added models currently, but we may have more audio+text or pure multimodal ones in the future. For now we start off by working on VLM and see how things fit in the general API

Your contribution

The major changes we are working on and planning to work are:

Standardization for Processors:
- We have ongoing work on uniform processor kwargs which currently will help us enable pipelines for VLMs and thus we can have correct automodel tag on the hub. The work is under progress by @yonigozlan and @molbap
- Parallel to that I will work on separating out video models under a new class (VideoProcessor) and handling a whole lot of deprecation cycle for the processing config files. At the end we should have separate file/separate class for video processing and save its params in its own config file. That will be tracked in https://github.com/huggingface/transformers/issues/33504 and has discussions with Amy in the linked issue under that
Standardization in terms of modeling code:
- One major thing was to get rid of buggy merge_embeds method and cover VLMs with more generation related tests, as we were getting many issues after a small change. Slow tests unfortunately don't cover everything and are not run every time a PR is merged. That is being tracked in https://github.com/huggingface/transformers/issues/33374
- Another major topic is setting attention implementation for composite models (not only VLMs) which will fix red CI and add uniformity to how we work with composite models in general. After that PR we should enforce each composite model to have a separate PreTrainedConfig for each model backbone in its architecture. And each sub-config should be part of one major ModelConfig which may hold specific attr for the composte model only (not its sub-backbones). See https://github.com/huggingface/transformers/pull/32238
- Separate out get_image_features method for all VLMs so we can have more modularity and prob make the code much cleaner. Was proposed by one of the community contributor and I'll handle propagating the change in all models. See https://github.com/huggingface/transformers/pull/33696
Standardization for chat templates:
- We can support (tokenize=True, return_tensors="pt") kwargs in processor's apply_chat_template, so that the method returns already vectorized outputs. Similar to tokenizers, the main point is to feed in a chat history and get tensor inputs ready for generation/train. The only difference is that users will have to explicitly add image file/url or ImageInput so we can process it internally and turn into pixel_values. Below is the general design. No work started yet, I am planning to make a PR some time in October
```
messages = [
{
"role": "user",
"content": [
    {"type": "image", "image": {"url": "https://...."}}},
    {"type": "text", "text": "What do you see here?"},
]
},
{
"role": "assistant",
"content": [
    {"type": "text", "text": "Stop sign [...]"},
]
},
{
"role": "user",
"content": [
    {"type": "image", "image":  {"path": "my_image.png"}}},
    {"type": "text", "text": "What color is the cat?"},
]
},       
]
```
Standardization for tokenizers:
- We can have new special tokens added to the tokinizers if they are loaded from a VLM model repo. Currently I have a plan to add at lest 3 new special tokens (image, boi and eoi), but given a wave of new models I might expand that list. I had a PR prev but that was a very basic design (https://github.com/huggingface/transformers/pull/31967). Currently working on making SpecialTokenMixin more flexible so that we can simply change the class attribute SPECIAL_TOKENS_ATTRIBUTES and everything else will work out-of-the-box. Seems to me the easiest way to expand special tokens for multimodal cases without flooding simple language model tokenizers.

huggingface / transformers