Open saket424 opened 2 months ago
I tried MiniCPM-V-2_6 naively and I got server-1 | INFO: 192.168.155.172:39070 - "POST /v1/chat/completions HTTP/1.1" 422 Unprocessable Entity
So need @matatonic assistance
Currently testing, but image only so far, no video.
I've updated a dev branch with the latest changes, including MiniCPM-V 2.6, microsoft/Phi-3.5-vision-instruct and fancyfeast/joy-caption-pre-alpha. I'm still testing and the :dev image is still building, so YMMV.
By video, they mean collection of images (so not quite video)
the dev build works. thanks
CLI_COMMAND="python vision.py -m openbmb/MiniCPM-V-2_6 --use-flash-attn --device-map cuda:0 --load-in-4bit"
anand@dell4090:~/openedai-stuff/openedai-vision$ docker compose up
[+] Running 2/1
✔ Network openedai-vision_default Created 0.1s
✔ Container openedai-vision-server-1 Created 0.0s
Attaching to server-1
server-1 | 2024-08-25 21:35:27.061 | INFO | main:slow_image_processor_class
, or fast_image_processor_class
instead
server-1 | warnings.warn(
server-1 | INFO: 192.168.155.172:39888 - "POST /v1/chat/completions HTTP/1.1" 200 OK
By video, they mean collection of images (so not quite video)
Yes, it's an image sampler technique - but still it's not working for me, the sample code they provide is failing to identify the video in my tests. Perhaps still my error, but it probably wont be fixed for this release.
There's another project that I like called amblegpt https://github.com/mhaowork/amblegpt that has the ffmpeg sampler built in and is openai compatible
We can try to use that to test this functionality
Merged to main, 0.29.0 release. I will leave this ticket open until video is supported.
i tried this standalone python code and it runs on my 4090 gpu
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu # pip install decord
model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)
MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number
def encode_video(video_path):
def uniform_sample(l, n):
gap = len(l) / n
idxs = [int(i * gap + gap / 2) for i in range(n)]
return [l[i] for i in idxs]
vr = VideoReader(video_path, ctx=cpu(0))
sample_fps = round(vr.get_avg_fps() / 1) # FPS
frame_idx = [i for i in range(0, len(vr), sample_fps)]
if len(frame_idx) > MAX_NUM_FRAMES:
frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
frames = vr.get_batch(frame_idx).asnumpy()
frames = [Image.fromarray(v.astype('uint8')) for v in frames]
print('num frames:', len(frames))
return frames
video_path="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
{'role': 'user', 'content': frames + [question]},
]
# Set decode params for video
params = {}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448
answer = model.chat(
image=None,
msgs=msgs,
tokenizer=tokenizer,
**params
)
print(answer)
https://github.com/user-attachments/assets/236ff243-d74a-4e2b-83b8-98580e56de36
python3 try.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 16.71it/s]
num frames: 15
/home/anand/2.6/venv2.6/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
warnings.warn(
The video begins with a news broadcast from FOX 11 at 5 PM, showing a cityscape during sunset or sunrise. It then transitions to footage of two individuals in what appears to be a school hallway near blue lockers and yellow caution lines on the floor. One individual is wearing a dark shirt and light-colored pants, while the other is in a white top and dark pants. The scene involves physical confrontation where one person is restrained by the other against the wall. The struggle continues as the person in the white top attempts to maintain control over the situation. Eventually, another individual enters, seemingly trying to mediate or intervene. The final frame features the FOX 11 logo with text "ONLY ON FOX 11," indicating exclusive content coverage.
@matatonic I managed to try this still unfinished PR for llama.cpp and it works
https://github.com/ggerganov/llama.cpp/pull/9165#issuecomment-2312984641
MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video
The claim is it performs very well for an 8 billion size model
I am interested in learning what it takes to add support for 2.6 when 2.5 is already supported
Thanks