Video-LLaVA-7B-hf doesn't work (returns nonsense)

royvelich commented 2 months ago

System Info

transformers version: 4.44.0
Platform: Linux-6.8.0-39-generic-x86_64-with-glibc2.39
Python version: 3.10.14
Huggingface_hub version: 0.24.5
Safetensors version: 0.4.3
Accelerate version: 0.33.0
Accelerate config: not found
PyTorch version (GPU?): 2.4.0 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: No
Using GPU in script?: Yes
GPU type: NVIDIA GeForce RTX 3090

Who can help?

@amyeroberts

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

First of all, it used to work for me, but suddenly, something happened, and it stopped. Anyway, I am trying to run the example given in https://huggingface.co/LanguageBind/Video-LLaVA-7B-hf.

Specifically, I run the following piece of code:

from PIL import Image
import requests
import numpy as np
import av
from huggingface_hub import hf_hub_download
from transformers import VideoLlavaProcessor, VideoLlavaForConditionalGeneration

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.

    Args:
        container (av.container.input.InputContainer): PyAV container.
        indices (List[int]): List of frame indices to decode.

    Returns:
        np.ndarray: np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

model = VideoLlavaForConditionalGeneration.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")
processor = VideoLlavaProcessor.from_pretrained("LanguageBind/Video-LLaVA-7B-hf")

prompt = "USER: <video>Why is this video funny? ASSISTANT:"
video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)

# sample uniformly 8 frames from the video
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)

inputs = processor(text=prompt, videos=clip, return_tensors="pt")

# Generate
generate_ids = model.generate(**inputs, max_length=80)
print(processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0])

I get the following as output:

USER: Why is this video funny? ASSISTANT: 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Do you have any idea why? It happens both on Windows and Ubuntu. I guess it relates to a version upgrade of one of the related packages since it used to work for me.

Expected behavior

Returns meaningful text.

zucchini-nlp commented 2 months ago

Hey! Yes, the last release broke video-llava generation and there's a PR open to fix it (https://github.com/huggingface/transformers/pull/32417)

royvelich commented 2 months ago

@zucchini-nlp will there be a new release once the PR is committed?

zucchini-nlp commented 2 months ago

This should be included in the next patch release, cc @ArthurZucker

huggingface / transformers