Support `hidden_states` Input Besides `input_ids` for Multi-Modality Models

huggingface / text-generation-inference

Large Language Model Text Generation Inference

http://hf.co/docs/text-generation-inference

Apache License 2.0

8.36k stars 948 forks source link

Support `hidden_states` Input Besides `input_ids` for Multi-Modality Models #847

Closed leng-yue closed 3 months ago

leng-yue commented 11 months ago

Feature request

Extend the current API to accept hidden_states as an optional input parameter in addition to input_ids. This would support integration with multi-modality models such as LLAVA, Video Lamma, BLIP2, etc.

Motivation

Many modern models are designed to process more than one type of data (e.g., text and images) and require inputting hidden_states directly. Allowing hidden_states as input would enable a seamless integration of these models, enhancing the overall utility of the TGI. This change aligns with the growing need for versatile, multi-modal solutions in the AI community.

Your contribution

I believe that this is a valuable addition to the TGI. The proposed change is relatively minor but has the potential to greatly improve functionality. I'm eager to contribute and would like to make a PR to implement this feature.

Narsil commented 10 months ago

We're already adding multi modal models, without requiring users to send hidden_states directly:

https://github.com/huggingface/text-generation-inference/pull/842

The API may change at any time, this is work in progress.

leng-yue commented 10 months ago

Since the cost of image preprocessing is typically lower than generating LLM, enabling the specification of hidden_states directly will be beneficial for experimenting with different encoders and projections. We don't have to include vision (or audio) parts in the TGI. Instead, we can compute these embeddings from another location and simply input them into the standardTGI API (for example, LLAMA2).

Narsil commented 10 months ago

experimenting with different encoders and projections.

This is not the purpose of TGI. We try to maintain production workloads (we actively maintain our own with it). We might add some layer if it really proves interesting, however this is not the case at the moment.

Interesting means -> We can cut end user latency in half. That's including round-trip to whatever external service. If it's not the case we're not doing it. @OlivierDehaene is not back yet, and it will probably require further discussion, but adding surface to TGI at the server level is quite a burden. We won't add that lightly (especially when the burden is not simple JSON.

Currently for idefics, sending a URL is extremely simple from a user's perspective, they don't have to maintain any other infra, and the network bandwidth accessible on the machines we use for prod means sending anything else than the URLs would be a latency hit (not even considering sending tensors over JSON is cumbersome at best).

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.