huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.28k stars 26.35k forks source link

LLava-Next example is broken #32273

Closed zjuerme closed 1 month ago

zjuerme commented 1 month ago

System Info

transformers==4.43.3

When I use the case to infer video in https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf

It's confused that the image inference is ok

code is

import av
import torch
from transformers import LlavaNextVideoProcessor, LlavaNextVideoForConditionalGeneration

model_id = "llava-hf/LLaVA-NeXT-Video-7B-hf"

model = LlavaNextVideoForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)

processor = LlavaNextVideoProcessor.from_pretrained(model_id)

def read_video_pyav(container, indices):
    '''
    Decode the video with PyAV decoder.
    Args:
        container (`av.container.input.InputContainer`): PyAV container.
        indices (`List[int]`): List of frame indices to decode.
    Returns:
        result (np.ndarray): np array of decoded frames of shape (num_frames, height, width, 3).
    '''
    frames = []
    container.seek(0)
    start_index = indices[0]
    end_index = indices[-1]
    for i, frame in enumerate(container.decode(video=0)):
        if i > end_index:
            break
        if i >= start_index and i in indices:
            frames.append(frame)
    return np.stack([x.to_ndarray(format="rgb24") for x in frames])

# define a chat histiry and use `apply_chat_template` to get correctly formatted prompt
# Each value in "content" has to be a list of dicts with types ("text", "image", "video") 
conversation = [
    {

        "role": "user",
        "content": [
            {"type": "text", "text": "Why is this video funny?"},
            {"type": "video"},
            ],
    },
]

prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

video_path = hf_hub_download(repo_id="raushan-testing-hf/videos-test", filename="sample_demo_1.mp4", repo_type="dataset")
container = av.open(video_path)

# sample uniformly 8 frames from the video, can sample more for longer videos
total_frames = container.streams.video[0].frames
indices = np.arange(0, total_frames, total_frames / 8).astype(int)
clip = read_video_pyav(container, indices)
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to(model.device)

output = model.generate(**inputs_video, max_new_tokens=100, do_sample=False)
print(processor.decode(output[0][2:], skip_special_tokens=True))

error is

ValueError: Number of image tokens in input_ids (0) different from num_images (1).

Who can help?

No response

Information

Tasks

Reproduction

use the case in https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf

Expected behavior

none

zucchini-nlp commented 1 month ago

@zjuerme hey!

I tried the same code and it's working for me in transformers==4.43.3. Are you on MLX, might be loosely related to https://github.com/huggingface/transformers/issues/30294.

If not, please share your env info, it would help me to figure out what is happening

zjuerme commented 1 month ago

Thanks for your answer. I have observed this problem 30294. and solved it by upgrading transformers (it is LLaVA-Next). The problem I am facing now is LLaVA-next-video.

My device is A6000 and the cofiguration as follow:

Python 3.10.14
torch==2.4.0
transformers==4.43.3
accelerate==0.33.0
av==12.3.0

If you need more detailed configuration, please continue to communicate! Thank you for your help

zucchini-nlp commented 1 month ago

Hmm, interesting. I cannot reproduce it yet, can I ask you to do try the following:

  1. Print out input Ids before generate and check if it contains image/video tokens

    inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to(model.device)
    print(inputs_video.input_ids, model.config.image_token_id in inputs_video.input_ids ,model.config.video_token_id in inputs_video.input_ids)
  2. Find out if the error comes from decoding stage or pre-fill stage by feeding the inputs to forward directly

    inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to(model.device)
    output = model(**inputs_video)
  3. If step number-2 fails with error, check out this and tell me the output.

    video_features = model._get_video_features(inputs_video.pixel_values_videos)
    video_features = [feature.flatten(0, 1) for feature in video_features]
    feature_lens = [feature.size(0) for feature in video_features]
    video_features = torch.cat(video_features, dim=0)
    feature_lens = torch.tensor(feature_lens, dtype=torch.long, device=video_features.device)
    print(video_features.shape, feature_lens.shape)
  4. If step-2 is successful, then it is related to past-key-values so please make sure you are generating with use_cache=True (which should be there by default)

zjuerme commented 1 month ago

Thank you for your enthusiastic answer! The problem has been solved. I finally found that the index of

I have found where the problem lies in: ref to the config.json

"image_token_index": 32001
"video_token_index": 32000,

however

conversation = [
            {

                "role": "user",
                "content": [
                    {"type": "text", "text": "Why is this video funny?"},
                    {"type": "video"},
                ],
            },
        ]
        print(self.model.config.video_token_index, self.model.config.image_token_index)
        prompt = self.processor.apply_chat_template(conversation, add_generation_prompt=True)
        cprint(prompt, 'cyan')

print

32000 32000
USER: Why is this video funny? ASSISTANT:
zucchini-nlp commented 1 month ago

@zjuerme thanks for investigating! The configs on the hub are correct currently, probably you have to force redownload them with model.from_pretrained(model_id, force_redownload=True)