Closed litianjian closed 2 weeks ago
This sounds reasonable. By the way, OpenAI has a different way of passing audio files (see their Audio API) where the file type inside the base64 URL is stored in format
.
completion = client.chat.completions.create(
model="gpt-4o-audio-preview",
modalities=["text", "audio"],
audio={"voice": "alloy", "format": "wav"},
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What is in this recording?"
},
{
"type": "input_audio",
"input_audio": {
"data": encoded_string,
"format": "wav"
}
}
]
},
]
)
print(completion.choices[0].message)
This sounds reasonable. By the way, OpenAI has a different way of passing audio files (see their Audio API) where the file type inside the base64 URL is stored in
format
.completion = client.chat.completions.create( model="gpt-4o-audio-preview", modalities=["text", "audio"], audio={"voice": "alloy", "format": "wav"}, messages=[ { "role": "user", "content": [ { "type": "text", "text": "What is in this recording?" }, { "type": "input_audio", "input_audio": { "data": encoded_string, "format": "wav" } } ] }, ] ) print(completion.choices[0].message)
This design is indeed clearer. BTW, does the Vision API need the file type when using base64 URL?
This design is indeed clearer. BTW, does the Vision API need the file type when using base64 URL?
Yes, Vision API expects the full base64 URL, whereas Audio API expects file type + main base64 content separately.
Let's follow Vision API for now. In the future, we can support both styles for all modalities.
I agree with the API proposed here - in your implementation, is the video parsing from URL implemented in memory? AFAIK OpenCV doesn't have an interface for encoding/decoding from memory, so I was wondering how it's done / if it's using other dependencies
@DarkLight1337 @litianjian In addition to video_url
, it would be cool to discuss whether or not we should allow collapsing multiple adjacent images into videos, or something similar to prevent ambiguity with models that support both multi-image and video. I had started to implement that as an experiment, but it's a pretty disjoint set of changes from what is needed for supporting video_url
, so I think they could be separate potential contributions to consider without conflicting with each other
🚀 The feature, motivation and pitch
Online video support for VLMs
vLLM already supports a large number of MultiModal Machine Learning visual models, some of which support image and video input,such as Qwen2-VL, LLaVA-Onevision, etc. Referring to the implementation of image, this proposal adds support for video.
Refer to the visual interfaces of OpenAI (vision and video) and Google Gemini, the visual interface should ideally support inputs from Video URLs and base64.
Image
Video
The interface design mentioned above is a prototype. I have implemented this interface in my pipeline.
Alternatives
No response
Additional context
No response
Before submitting a new issue...