Closed zjuerme closed 1 month ago
@zjuerme hey!
I tried the same code and it's working for me in transformers==4.43.3. Are you on MLX, might be loosely related to https://github.com/huggingface/transformers/issues/30294.
If not, please share your env info, it would help me to figure out what is happening
Thanks for your answer. I have observed this problem 30294. and solved it by upgrading transformers (it is LLaVA-Next). The problem I am facing now is LLaVA-next-video.
My device is A6000 and the cofiguration as follow:
Python 3.10.14
torch==2.4.0
transformers==4.43.3
accelerate==0.33.0
av==12.3.0
If you need more detailed configuration, please continue to communicate! Thank you for your help
Hmm, interesting. I cannot reproduce it yet, can I ask you to do try the following:
Print out input Ids before generate and check if it contains image/video tokens
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to(model.device)
print(inputs_video.input_ids, model.config.image_token_id in inputs_video.input_ids ,model.config.video_token_id in inputs_video.input_ids)
Find out if the error comes from decoding stage or pre-fill stage by feeding the inputs to forward directly
inputs_video = processor(text=prompt, videos=clip, padding=True, return_tensors="pt").to(model.device)
output = model(**inputs_video)
If step number-2 fails with error, check out this and tell me the output.
video_features = model._get_video_features(inputs_video.pixel_values_videos)
video_features = [feature.flatten(0, 1) for feature in video_features]
feature_lens = [feature.size(0) for feature in video_features]
video_features = torch.cat(video_features, dim=0)
feature_lens = torch.tensor(feature_lens, dtype=torch.long, device=video_features.device)
print(video_features.shape, feature_lens.shape)
If step-2 is successful, then it is related to past-key-values
so please make sure you are generating with use_cache=True
(which should be there by default)
Thank you for your enthusiastic answer! The problem has been solved. I finally found that the index of
I have found where the problem lies in: ref to the config.json
"image_token_index": 32001
"video_token_index": 32000,
however
conversation = [
{
"role": "user",
"content": [
{"type": "text", "text": "Why is this video funny?"},
{"type": "video"},
],
},
]
print(self.model.config.video_token_index, self.model.config.image_token_index)
prompt = self.processor.apply_chat_template(conversation, add_generation_prompt=True)
cprint(prompt, 'cyan')
32000 32000
USER: Why is this video funny? ASSISTANT:
@zjuerme thanks for investigating! The configs on the hub are correct currently, probably you have to force redownload them with model.from_pretrained(model_id, force_redownload=True
)
System Info
transformers==4.43.3
When I use the case to infer video in https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf
It's confused that the image inference is ok
code is
error is
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
use the case in https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf
Expected behavior
none