Closed leng-yue closed 3 months ago
We're already adding multi modal models, without requiring users to send hidden_states
directly:
https://github.com/huggingface/text-generation-inference/pull/842
The API may change at any time, this is work in progress.
Since the cost of image preprocessing is typically lower than generating LLM, enabling the specification of hidden_states directly will be beneficial for experimenting with different encoders and projections. We don't have to include vision (or audio) parts in the TGI. Instead, we can compute these embeddings from another location and simply input them into the standardTGI API (for example, LLAMA2).
experimenting with different encoders and projections.
This is not the purpose of TGI. We try to maintain production workloads (we actively maintain our own with it). We might add some layer if it really proves interesting, however this is not the case at the moment.
Interesting means -> We can cut end user latency in half. That's including round-trip to whatever external service. If it's not the case we're not doing it. @OlivierDehaene is not back yet, and it will probably require further discussion, but adding surface to TGI at the server level is quite a burden. We won't add that lightly (especially when the burden is not simple JSON.
Currently for idefics, sending a URL is extremely simple from a user's perspective, they don't have to maintain any other infra, and the network bandwidth accessible on the machines we use for prod means sending anything else than the URLs would be a latency hit (not even considering sending tensors over JSON is cumbersome at best).
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Feature request
Extend the current API to accept
hidden_states
as an optional input parameter in addition toinput_ids
. This would support integration with multi-modality models such as LLAVA, Video Lamma, BLIP2, etc.Motivation
Many modern models are designed to process more than one type of data (e.g., text and images) and require inputting
hidden_states
directly. Allowinghidden_states
as input would enable a seamless integration of these models, enhancing the overall utility of the TGI. This change aligns with the growing need for versatile, multi-modal solutions in the AI community.Your contribution
I believe that this is a valuable addition to the TGI. The proposed change is relatively minor but has the potential to greatly improve functionality. I'm eager to contribute and would like to make a PR to implement this feature.