huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.

https://huggingface.co/transformers

Apache License 2.0

129.81k stars 25.79k forks source link

LlaVa model in transformers #25060

Closed RajeshRadha closed 6 months ago

RajeshRadha commented 1 year ago

Feature request

Support to Llava model in transformers? https://github.com/haotian-liu/LLaVA Similar to InstructBlip w/ connector module between image embeddings and LLM's

Motivation

Llava is performing really well in MLLM related tasks and for folks to try out InstructBlip vs Llava models it makes it easier if it's in hugging face as it's mostly using the same Image Encoder embeddings from (EVA or ViT or CLIP) and foundational models from (T5 or Vicuna or Llama-2). Code maintenance and ease of integration is easy

Your contribution

I can definitely help with a PR or tag along with folks in hugging face to make it happen

ydshieh commented 1 year ago

Hi @RajeshRadha Thank you for the feature request.

As @ArthurZucker mentioning to me, the repo. has reached 4K starts and 300 fork, it seems this is quite popular.

Will leave our core maintainers @amyeroberts and @sgugger to see if this qualifies the model to be in transformers or we still prefer to have it first on the Hub.

amyeroberts commented 12 months ago

Given the popularity and performance of the model, I think it'd be a good addition into transformers :)

@RajeshRadha if you'd like to add the model, feel free to open a PR and tag @ArthurZucker and myself for review.

ArthurZucker commented 11 months ago

Just for reference, before the model got so popular, #22848 and #23849 were opened!

ZeguanXiao commented 11 months ago

Any update about this model? https://github.com/huggingface/transformers/pull/23849 is closed and unactivated.

ArthurZucker commented 11 months ago

cc @rafaelpadilla and @amyeroberts if one of you has the bandwidth

amyeroberts commented 11 months ago

I won't have time unfortunately before I'm off :( If @rafaelpadilla or anyone in the community would like to add this model - it would be a great addition!

ArthurZucker commented 7 months ago

PR will be merged the coming week 😉

github-actions[bot] commented 7 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ArthurZucker commented 6 months ago

27662 closes this

RonanKMcGovern commented 5 months ago

This is a great integration. As a further step, it would be great to have an API for multi-modal models.

I think it's unlikely TGI (see here) or vLLM would integrate multi-modal as it's too different.

There is a (closed) PR on the Llava project that allows for a simple single-call API. Possibly building on that is a good way to go.

A key feature I see as valuable is continuous batching, this is what really allows devs to spin up a multi-modal end point for production.

Questions

Is it too much of a stretch to try and add continuous batching to transformers? I'm guessing yes, because for LLMs, that has been offloaded to TGI.
Are there other angles that should be considered generally for getting to a multi modal API?

younesbelkada commented 5 months ago

Thanks @RonanKMcGovern for your feedback ! I think TGI could support multi-modal models as they did it in the past with idefics if I am not mistaken cc @OlivierDehaene

RonanKMcGovern commented 5 months ago

Thanks @younesbelkada that makes sense intuitively. IDEFIX (flamenco style models) have a single tokenizer, whether it's image or text (if I'm not mistaken) so that makes it easier plug and play for TFI.

I see that as a pretty significant advantage. With an a good inference endpoint, llava just isn't as useful because devs can't use it well in production.

I need to read more on why llava 1.6 is stronger than IDEFIX. I guess IDEFIX has the drawback that it had to be entirely trained from scratch.

Makes me wonder whether it would have been better to take an IDEFIX approach in making Llava.