Support video in MiniCPM-V 2.6

saket424 commented 2 months ago

MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video

The claim is it performs very well for an 8 billion size model

I am interested in learning what it takes to add support for 2.6 when 2.5 is already supported

Thanks

saket424 commented 2 months ago

I tried MiniCPM-V-2_6 naively and I got server-1 | INFO: 192.168.155.172:39070 - "POST /v1/chat/completions HTTP/1.1" 422 Unprocessable Entity

So need @matatonic assistance

matatonic commented 2 months ago

Currently testing, but image only so far, no video.

matatonic commented 2 months ago

I've updated a dev branch with the latest changes, including MiniCPM-V 2.6, microsoft/Phi-3.5-vision-instruct and fancyfeast/joy-caption-pre-alpha. I'm still testing and the :dev image is still building, so YMMV.

saket424 commented 2 months ago

By video, they mean collection of images (so not quite video)

https://github.com/ggerganov/llama.cpp/pull/9165

saket424 commented 2 months ago

the dev build works. thanks CLI_COMMAND="python vision.py -m openbmb/MiniCPM-V-2_6 --use-flash-attn --device-map cuda:0 --load-in-4bit" anand@dell4090:~/openedai-stuff/openedai-vision$ docker compose up [+] Running 2/1 ✔ Network openedai-vision_default Created 0.1s ✔ Container openedai-vision-server-1 Created 0.0s Attaching to server-1 server-1 | 2024-08-25 21:35:27.061 | INFO | main::143 - Loading VisionQnA[minicpm-v-2_6] with openbmb/MiniCPM-V-2_6 Loading checkpoint shards: 100% 4/4 [00:07<00:00, 1.75s/it] server-1 | 2024-08-25 21:35:36.056 | INFO | vision_qna:loaded_banner:94 - Loaded openbmb/MiniCPM-V-2_6 [ device: cuda:0, dtype: torch.bfloat16, template: internal ] server-1 | INFO: Started server process [7] server-1 | INFO: Waiting for application startup. server-1 | INFO: Application startup complete. server-1 | INFO: Uvicorn running on http://0.0.0.0:5006 (Press CTRL+C to quit) preprocessor_config.json: 100% 714/714 [00:00<00:00, 8.25MB/s] processing_minicpmv.py: 100% 10.0k/10.0k [00:00<00:00, 63.4MB/s] image_processing_minicpmv.py: 100% 16.6k/16.6k [00:00<00:00, 104MB/s] server-1 | A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-V-2_6: server-1 | - image_processing_minicpmv.py server-1 | . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. server-1 | A new version of the following files was downloaded from https://huggingface.co/openbmb/MiniCPM-V-2_6: server-1 | - processing_minicpmv.py server-1 | - image_processing_minicpmv.py server-1 | . Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision. server-1 | /usr/local/lib/python3.11/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use slow_image_processor_class, or fast_image_processor_class instead server-1 | warnings.warn( server-1 | INFO: 192.168.155.172:39888 - "POST /v1/chat/completions HTTP/1.1" 200 OK

matatonic commented 2 months ago

By video, they mean collection of images (so not quite video)

ggerganov/llama.cpp#9165

Yes, it's an image sampler technique - but still it's not working for me, the sample code they provide is failing to identify the video in my tests. Perhaps still my error, but it probably wont be fixed for this release.

saket424 commented 2 months ago

There's another project that I like called amblegpt https://github.com/mhaowork/amblegpt that has the ffmpeg sampler built in and is openai compatible

We can try to use that to test this functionality

matatonic commented 2 months ago

Merged to main, 0.29.0 release. I will leave this ticket open until video is supported.

saket424 commented 2 months ago

i tried this standalone python code and it runs on my 4090 gpu

import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer
from decord import VideoReader, cpu    # pip install decord

model = AutoModel.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True,
    attn_implementation='sdpa', torch_dtype=torch.bfloat16) # sdpa or flash_attention_2, no eager
model = model.eval().cuda()
tokenizer = AutoTokenizer.from_pretrained('openbmb/MiniCPM-V-2_6', trust_remote_code=True)

MAX_NUM_FRAMES=64 # if cuda OOM set a smaller number

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames

video_path="video_test.mp4"
frames = encode_video(video_path)
question = "Describe the video"
msgs = [
    {'role': 'user', 'content': frames + [question]}, 
]

# Set decode params for video
params = {}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # use 1 if cuda OOM and video resolution > 448*448

answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    **params
)
print(answer)

https://github.com/user-attachments/assets/236ff243-d74a-4e2b-83b8-98580e56de36

python3 try.py 
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 16.71it/s]
num frames: 15
/home/anand/2.6/venv2.6/lib/python3.10/site-packages/transformers/models/auto/image_processing_auto.py:513: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
The video begins with a news broadcast from FOX 11 at 5 PM, showing a cityscape during sunset or sunrise. It then transitions to footage of two individuals in what appears to be a school hallway near blue lockers and yellow caution lines on the floor. One individual is wearing a dark shirt and light-colored pants, while the other is in a white top and dark pants. The scene involves physical confrontation where one person is restrained by the other against the wall. The struggle continues as the person in the white top attempts to maintain control over the situation. Eventually, another individual enters, seemingly trying to mediate or intervene. The final frame features the FOX 11 logo with text "ONLY ON FOX 11," indicating exclusive content coverage.

saket424 commented 2 months ago

@matatonic I managed to try this still unfinished PR for llama.cpp and it works

https://github.com/ggerganov/llama.cpp/pull/9165#issuecomment-2312984641

matatonic / openedai-vision

Support video in MiniCPM-V 2.6 #14