When querying transformers, an <image> placeholder is used and the images are passed as a separate input argument to the prompt. This doesn't appear to be the case with TGI, which just expects a prompt input.
Something like this:
curl https://yd64jhjr8ylu54-8080.proxy.runpod.net/generate \
-X POST \
-d '{"inputs": "User: ![](http://images.cocodataset.org/val2017/000000219578.jpg)Tell me about this image<end_of_utterance>\\nAssistant:","parameters":{"max_new_tokens":20}}' \
-H 'Content-Type: application/json'
works, although it fails when trying to do two images (the model ignores the second image):
curl https://yd64jhjr8ylu54-8080.proxy.runpod.net/generate \
-X POST \
-d '{"inputs": "User: ![](http://images.cocodataset.org/val2017/000000219578.jpg)Tell me about this image, and also about this second image: ![](http://images.cocodataset.org/val2017/000000039769.jpg)<end_of_utterance>\\nAssistant:","parameters":{"max_new_tokens":50}}' \
-H 'Content-Type: application/json'
System Info
NA
Information
Tasks
Reproduction
It is unclear how to query TGI for multi-modal models.
The links to LLaVA Next and IDEFICS2 give 404:
https://huggingface.co/docs/text-generation-inference/HuggingFaceM4/idefics-9b-instruct
https://huggingface.co/docs/text-generation-inference/llava-hf/llava-v1.6-mistral-7b-hf
@Narsil @VictorSanh
Expected behavior
When querying transformers, an
<image>
placeholder is used and the images are passed as a separate input argument to the prompt. This doesn't appear to be the case with TGI, which just expects a prompt input.Something like this:
works, although it fails when trying to do two images (the model ignores the second image):