vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.69k stars 4.66k forks source link

[Feature]: Online video support for VLMs #9842

Closed litianjian closed 2 weeks ago

litianjian commented 3 weeks ago

🚀 The feature, motivation and pitch

Online video support for VLMs

vLLM already supports a large number of MultiModal Machine Learning visual models, some of which support image and video input,such as Qwen2-VL, LLaVA-Onevision, etc. Referring to the implementation of image, this proposal adds support for video.

Refer to the visual interfaces of OpenAI (vision and video) and Google Gemini, the visual interface should ideally support inputs from Video URLs and base64.

Image

chat_response = client.chat.completions.create(
    model="llava",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What’s in this image?"},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{image}"}},
        ],
    }],
)

Video

chat_response = client.chat.completions.create(
    model="llava",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What’s in this video?"},
            {"type": "video_url", "video_url": {"url": video_url}},
        ],
    }],
)
chat_response = client.chat.completions.create(
    model="llava",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What’s in this video?"},
            {"type": "video_url", "video_url": {"url": f"data:video/png;base64,{video_base64}"}},
        ],
    }],
)

The interface design mentioned above is a prototype. I have implemented this interface in my pipeline.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

DarkLight1337 commented 3 weeks ago

This sounds reasonable. By the way, OpenAI has a different way of passing audio files (see their Audio API) where the file type inside the base64 URL is stored in format.

completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": [
                { 
                    "type": "text",
                    "text": "What is in this recording?"
                },
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": encoded_string,
                        "format": "wav"
                    }
                }
            ]
        },
    ]
)

print(completion.choices[0].message)
litianjian commented 3 weeks ago

This sounds reasonable. By the way, OpenAI has a different way of passing audio files (see their Audio API) where the file type inside the base64 URL is stored in format.

completion = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    modalities=["text", "audio"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": [
                { 
                    "type": "text",
                    "text": "What is in this recording?"
                },
                {
                    "type": "input_audio",
                    "input_audio": {
                        "data": encoded_string,
                        "format": "wav"
                    }
                }
            ]
        },
    ]
)

print(completion.choices[0].message)

This design is indeed clearer. BTW, does the Vision API need the file type when using base64 URL?

DarkLight1337 commented 3 weeks ago

This design is indeed clearer. BTW, does the Vision API need the file type when using base64 URL?

Yes, Vision API expects the full base64 URL, whereas Audio API expects file type + main base64 content separately.

DarkLight1337 commented 3 weeks ago

Let's follow Vision API for now. In the future, we can support both styles for all modalities.

alex-jw-brooks commented 3 weeks ago

I agree with the API proposed here - in your implementation, is the video parsing from URL implemented in memory? AFAIK OpenCV doesn't have an interface for encoding/decoding from memory, so I was wondering how it's done / if it's using other dependencies

@DarkLight1337 @litianjian In addition to video_url, it would be cool to discuss whether or not we should allow collapsing multiple adjacent images into videos, or something similar to prevent ambiguity with models that support both multi-image and video. I had started to implement that as an experiment, but it's a pretty disjoint set of changes from what is needed for supporting video_url, so I think they could be separate potential contributions to consider without conflicting with each other