Open RajeshRadha opened 1 year ago
Great suggestion! From v0 -> v1.0.0 code base, I have refactored in such a direction for supporting different vision encoders and LLMs, and the next step would definitely be supporting swapping any LLMs / vision encoders by using some standard HF structures.
Currently I may have limited bandwidth in working on that, but I would definitely be happy to support and collaborate during the integration process!
Happy to chat more about the details if you are interested in working on the integration. Thanks!
Question
What is the work involved in adding Llava to Hugging Face transformers package?
I already see InstructBlip in there -- https://huggingface.co/models?other=instructblip and here is the other list of supported models: https://huggingface.co/docs/transformers/index
Since LLava uses image encoders like CLIP / EVA or ViT and foundational models like Llama / T5 / Vicuna etc. It makes it easy to plug and play different models and see how well the connector module design is playing with them. Any thoughts on this?