intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.76k stars 1.27k forks source link

Encounter error when running Qwen2-VL in ipex-llm processing input video with large frame number #12469

Open zhangcong2019 opened 3 days ago

zhangcong2019 commented 3 days ago

Encounter error when running qwen2-VL in ipex-llm processing input video with big frame number, below is detail error message and code, video attached as well.

Error information

  File "/home/lvm/qwenvl/reproduce.py", line 53, in query_video
    generated_ids = model.generate(**inputs, max_new_tokens=128)
  File "/home/lvm/miniforge3/envs/qwen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/lvm/miniforge3/envs/qwen/lib/python3.10/site-packages/ipex_llm/transformers/pipeline_parallel.py", line 283, in generate
    return original_generate(self,
  File "/home/lvm/miniforge3/envs/qwen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/lvm/miniforge3/envs/qwen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2215, in generate
    result = self._sample(
  File "/home/lvm/miniforge3/envs/qwen/lib/python3.10/site-packages/transformers/generation/utils.py", line 3249, in _sample
    next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0

Video: https://github.com/user-attachments/assets/fa970bd8-294b-44c3-b807-ffa3f85e1046

Code:

import os
#os.environ['CURL_CA_BUNDLE'] = ''
os.environ['HF_ENDPOINT']='https://hf-mirror.com'
# os.environ['CUDA_VISIBLE_DEVICES']='1'

from math import ceil
import torchvision
import transformers
import torch

print(transformers.__version__)

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

def query_video(prompt, video_path=None):
    # Create messages structure for the entire video
    messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "video",
                    "video": f"file://{video_path}",
                    "max_pixels": 360 * 420,
                    "fps": 6,
                },
                {"type": "text", "text": prompt},
            ],
        }
    ]

    # Preparation for inference
    text = processor.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=True
    )
    image_inputs, video_inputs = process_vision_info(messages)
    # image_inputs = image_inputs.to('xpu')
    # video_inputs = video_inputs.to('xpu')
    inputs = processor(
        text=[text],
        images=image_inputs,
        videos=video_inputs,
        padding=True,
        return_tensors="pt",
    )

    inputs = inputs.to("xpu")

    # Inference
    with torch.no_grad():  # Use no_grad to save memory during inference
        generated_ids = model.generate(**inputs, max_new_tokens=128)

    # Trim the generated output to remove the input prompt
    generated_ids_trimmed = [
        out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
    ]

    # Decode the generated text
    output_text = processor.batch_decode(
        generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
    )

    print(output_text)
    torch.xpu.empty_cache()

model_name = "Qwen/Qwen2-VL-2B-Instruct"

video_name = "[path to video]gymnast.mp4"

from ipex_llm import optimize_model
model = Qwen2VLForConditionalGeneration.from_pretrained(
                                                model_name,
                                                trust_remote_code=True,
                                                torch_dtype='auto',
                                                low_cpu_mem_usage=True,
                                                use_cache=True)
model = optimize_model(model, low_bit='sym_int4', modules_to_not_convert=["visual"])
model = model.half().to("xpu")

# default processer
processor = AutoProcessor.from_pretrained(model_name)

query_video("describe the video in detail", video_path=video_name)
torch                         2.1.0a0+cxx11.abi
torchaudio                    2.1.0a0+cxx11.abi
torchvision                   0.16.0a0+cxx11.abi
transformers                  4.46.3
intel-extension-for-pytorch   2.1.10+xpu
ipex-llm                      2.2.0b20241126
MeouSker77 commented 3 hours ago

hi, this error is caused by fp16 overflow, we'll fix it as soon as possible.

For now, if you are using Arc A7xx/5xx/3xx or Lunar Lake (Ultra 2xxV), you can try model = model.float().to("xpu") instead of model = model.half().to("xpu").