Closed stikkireddy closed 5 months ago
I believe image input protocol has not been implemented indeed! This is more than documentation.
The PR #3042 , which introduced the LLaVA feature, appears not to incorporate functionalities for the OpenAI-compatible server. Based on the documentation, it's feasible to extend the existing OpenAI-compatible server (See Image Input tab in following Link) to support this feature without the need to develop a dedicated server specifically for image inputs. However, it's important to note the distinctions between GPT-4V and LLaVA, particularly that LLaVA currently does not support multiple image inputs and the 'detail' parameter.
According to OpenAI Documentation,
GPT-4 with vision is currently available to all developers who have access to GPT-4 via the gpt-4-vision-preview model and the Chat Completions API which has been updated to support image inputs.
Example of uploading base64 encoded images
import base64
import requests
# OpenAI API Key
api_key = "YOUR_OPENAI_API_KEY"
# Function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# Path to your image
image_path = "path_to_your_image.jpg"
# Getting the base64 string
base64_image = encode_image(image_path)
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}"
}
payload = {
"model": "gpt-4-vision-preview",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Whatβs in this image?"
},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
}
}
]
}
],
"max_tokens": 300
}
response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
print(response.json())
Please inform me if anyone is already working on implementing this feature. If not, I'm willing to take on the task and aim to complete it by the end of April. (Hopefully)
Based on examples/llava_example.py
, I have recently forked vllm-rocm
to support image input by refactoring OpenAIServingChat
. I have already verified that the model generates useful output when given OpenAI's quick start example.
Note: This change adds pillow
as a dependency since it is used to read the image from bytes.
However, there is more work to be done:
llava-hf/llava-1.5-7b-hf
since vLLM has existing support for its LlavaForConditionalGeneration
architecture. Unfortunately, their config does not provide a chat template, so you have to provide it via command line (--chat-template examples/template_llava.jinja
) which is quite inconvenient.LlavaLlamaForCausalLM
architecture which is adopted by the original author (liuhaotian/llava-v1.5-7b
).UPDATE: I have created a new branch on my fork (openai-vision-api
) that consolidates my changes so far. The original upstream
branch is now directly synced with upstream/upstream
(discarding my previous commits) to be in line with the usual naming conventions.
thankfully i only need llava π! @DarkLight1337 do you plan on pushing this back to vllm along with the chat template?
thankfully i only need llava π! @DarkLight1337 do you plan on pushing this back to vllm along with the chat template?
I'll create a PR once more testing has been done.
It would be great if we could compile a list of models that work/don't work with my implementation of this API. Currently, I assume that at most one image is provided since it appears that this is also the case for vLLM internals. How difficult would it be to support multiple images (possibly of different sizes)?
Do there exist models support multiple image inputs?
GPT-4's API supports multiple images, so I guess their model can already handle such input.
Looking at open source, I found that MMICL explicitly supports multiple images per text prompt. They use <imagej>
as the token to represent the j
th image. To accommodate this, we may need to add a config option to specify how to insert image tokens into the text prompt. Currently, we use <image> * image_feature_size
to represent each image; it would be more convenient to follow the original models which only use a single <image>
token per image, regardless of feature size.
I have opened a PR to support single-image input, with a POC using llava-hf/llava-1.5-7b-hf
. Hopefully, this is enough to get the ball rolling.
We can deal with multi-image input further down the line.
NOTE: If you have previously checked out upstream
branch based on this issue, please note that my changes have been moved to the openai-vision-api
branch; the upstream
branch is now directly synced with upstream/upstream
(discarding my previous commits) to be in line with the usual naming conventions.
FYI - this is WIP and we plan to have it in the next major release. See our plan here https://github.com/vllm-project/vllm/issues/4194#issuecomment-2126436729
Closing this as we merged https://github.com/vllm-project/vllm/pull/5237
π The doc issue
Hey vLLM team it looks like there is added support for llava 1.5 but there are no docs or examples on how to use it via the api server. Are there any reference examples? For using llava via the OpenAI sdk?
Suggest a potential alternative/fix
No response