[Question] Add Llava to Hugging face Transformers package?

haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

Apache License 2.0

19.58k stars 2.16k forks source link

Question

What is the work involved in adding Llava to Hugging Face transformers package?

I already see InstructBlip in there -- https://huggingface.co/models?other=instructblip and here is the other list of supported models: https://huggingface.co/docs/transformers/index

Since LLava uses image encoders like CLIP / EVA or ViT and foundational models like Llama / T5 / Vicuna etc. It makes it easy to plug and play different models and see how well the connector module design is playing with them. Any thoughts on this?

haotian-liu / LLaVA

[Question] Add Llava to Hugging face Transformers package? #304

Question