vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.73k stars 4.49k forks source link

[Doc/Feature]: Llava 1.5 in OpenAI compatible server #3873

Closed stikkireddy closed 5 months ago

stikkireddy commented 7 months ago

πŸ“š The doc issue

Hey vLLM team it looks like there is added support for llava 1.5 but there are no docs or examples on how to use it via the api server. Are there any reference examples? For using llava via the OpenAI sdk?

Suggest a potential alternative/fix

No response

simon-mo commented 7 months ago

I believe image input protocol has not been implemented indeed! This is more than documentation.

alsichcan commented 7 months ago

The PR #3042 , which introduced the LLaVA feature, appears not to incorporate functionalities for the OpenAI-compatible server. Based on the documentation, it's feasible to extend the existing OpenAI-compatible server (See Image Input tab in following Link) to support this feature without the need to develop a dedicated server specifically for image inputs. However, it's important to note the distinctions between GPT-4V and LLaVA, particularly that LLaVA currently does not support multiple image inputs and the 'detail' parameter.

According to OpenAI Documentation,

GPT-4 with vision is currently available to all developers who have access to GPT-4 via the gpt-4-vision-preview model and the Chat Completions API which has been updated to support image inputs.

Example of uploading base64 encoded images

import base64
import requests

# OpenAI API Key
api_key = "YOUR_OPENAI_API_KEY"

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Path to your image
image_path = "path_to_your_image.jpg"

# Getting the base64 string
base64_image = encode_image(image_path)

headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {api_key}"
}

payload = {
  "model": "gpt-4-vision-preview",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What’s in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
          }
        }
      ]
    }
  ],
  "max_tokens": 300
}

response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

print(response.json())

Please inform me if anyone is already working on implementing this feature. If not, I'm willing to take on the task and aim to complete it by the end of April. (Hopefully)

DarkLight1337 commented 7 months ago

Based on examples/llava_example.py, I have recently forked vllm-rocm to support image input by refactoring OpenAIServingChat. I have already verified that the model generates useful output when given OpenAI's quick start example.

Note: This change adds pillow as a dependency since it is used to read the image from bytes.

However, there is more work to be done:

UPDATE: I have created a new branch on my fork (openai-vision-api) that consolidates my changes so far. The original upstream branch is now directly synced with upstream/upstream (discarding my previous commits) to be in line with the usual naming conventions.

stikkireddy commented 7 months ago

thankfully i only need llava πŸ˜„! @DarkLight1337 do you plan on pushing this back to vllm along with the chat template?

DarkLight1337 commented 7 months ago

thankfully i only need llava πŸ˜„! @DarkLight1337 do you plan on pushing this back to vllm along with the chat template?

I'll create a PR once more testing has been done.

It would be great if we could compile a list of models that work/don't work with my implementation of this API. Currently, I assume that at most one image is provided since it appears that this is also the case for vLLM internals. How difficult would it be to support multiple images (possibly of different sizes)?

simon-mo commented 7 months ago

Do there exist models support multiple image inputs?

DarkLight1337 commented 7 months ago

GPT-4's API supports multiple images, so I guess their model can already handle such input.

Looking at open source, I found that MMICL explicitly supports multiple images per text prompt. They use <imagej> as the token to represent the jth image. To accommodate this, we may need to add a config option to specify how to insert image tokens into the text prompt. Currently, we use <image> * image_feature_size to represent each image; it would be more convenient to follow the original models which only use a single <image> token per image, regardless of feature size.

DarkLight1337 commented 7 months ago

I have opened a PR to support single-image input, with a POC using llava-hf/llava-1.5-7b-hf. Hopefully, this is enough to get the ball rolling.

We can deal with multi-image input further down the line.

NOTE: If you have previously checked out upstream branch based on this issue, please note that my changes have been moved to the openai-vision-api branch; the upstream branch is now directly synced with upstream/upstream (discarding my previous commits) to be in line with the usual naming conventions.

ywang96 commented 5 months ago

FYI - this is WIP and we plan to have it in the next major release. See our plan here https://github.com/vllm-project/vllm/issues/4194#issuecomment-2126436729

ywang96 commented 5 months ago

Closing this as we merged https://github.com/vllm-project/vllm/pull/5237