haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
19.58k stars 2.16k forks source link

[Question] Add Llava to Hugging face Transformers package? #304

Open RajeshRadha opened 1 year ago

RajeshRadha commented 1 year ago

Question

What is the work involved in adding Llava to Hugging Face transformers package?

I already see InstructBlip in there -- https://huggingface.co/models?other=instructblip and here is the other list of supported models: https://huggingface.co/docs/transformers/index

Since LLava uses image encoders like CLIP / EVA or ViT and foundational models like Llama / T5 / Vicuna etc. It makes it easy to plug and play different models and see how well the connector module design is playing with them. Any thoughts on this?

haotian-liu commented 1 year ago

Great suggestion! From v0 -> v1.0.0 code base, I have refactored in such a direction for supporting different vision encoders and LLMs, and the next step would definitely be supporting swapping any LLMs / vision encoders by using some standard HF structures.

Currently I may have limited bandwidth in working on that, but I would definitely be happy to support and collaborate during the integration process!

Happy to chat more about the details if you are interested in working on the integration. Thanks!