llava-hf/llava-1.5-7b-hf: error when multi-turn chat with multi-images

Johere commented 4 weeks ago

from ipex_llm import optimize_model
from transformers import LlavaForConditionalGeneration
model = LlavaForConditionalGeneration.from_pretrained('llava-hf/llava-1.5-7b-hf', device_map="cpu")
model = optimize_model(model, low_bit='sym_int4')
model = model.eval().to('xpu')

Multi-turn chat is like:

1st-round: http://farm6.staticflickr.com/5268/5602445367_3504763978_z.jpg What is this?
2nd-round: http://farm5.staticflickr.com/4031/4440753665_631134eaa4_z.jpg What are the differences between these two images?

Error logs:

Traceback (most recent call last): 
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner 
    self.run() 
  File "/usr/lib/python3.10/threading.py", line 953, in run 
    self._target(*self._args, **self._kwargs) 
  File "/home/ipex-llm-serving/dependency/model_worker.py", line 85, in model_generate 
    raise NotImplementedError(f"Unsupported model: {self.model_name}, error: {error}") 
NotImplementedError: Unsupported model: llava-hf/llava-1.5-7b-hf, error: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead.

The error is located as: /usr/lib/python3.10/site-packages/ipex_llm/transformers/low_bit_linear.py :729 x_2d = x.view(-1, x_shape[-1])

If I modify as: x_2d = x.contiguous().view(-1, x_shape[-1]), everything will be OK. I think the issue is related to LLaVA model's vision_feature_select_strategy (vision_feature_select_strategy=default) which may make the tensor discontiguous.

Can anyone help on this issue? Thanks!

Python packages: ipex-llm 2.2.0b20241011 transformers 4.45.2

Oscilloscope98 commented 4 weeks ago

Hi @Johere, we are reproducing this issue. We will update here for any progress :)

JinheTang commented 3 weeks ago

Hi @Johere , we have updated our llava example for llava-hf/llava-1.5-7b-hf. Please follow the instructions in the latest llava example to see if it works.

If the issue continues, could you please share the scripts you're using to run the multi-turn chat, along with the output from our env-check scripts, to help us gather more details? :)

Johere commented 3 weeks ago

Hi @Johere , we have updated our llava example for llava-hf/llava-1.5-7b-hf. Please follow the instructions in the latest llava example to see if it works.

If the issue continues, could you please share the scripts you're using to run the multi-turn chat, along with the output from our env-check scripts, to help us gather more details? :)

Hi @JinheTang Thanks for your reply. The problem still exists. To reproduce the problem I met, please modify several lines of the latest llava example:

diff --git a/python/llm/example/GPU/PyTorch-Models/Model/llava/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/llava/generate.py
index b70e22541a..c3b35ee2d8 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/llava/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/llava/generate.py
@@ -56,8 +56,23 @@ if __name__ == '__main__':
                 {"type": "image"},
                 {"type": "text", "text": prompt}
             ]
+        },
+        # mimic a multi-round chat
+        {
+            'role': 'assistant',
+            'content': [
+                {'type': 'text', 'text': 'The image features a young girl holding a stuffed teddy bear.'}
+            ]
+        },
+        {
+            "role": "user",
+            "content": [
+                {"type": "image"},
+                {"type": "text", "text": "Describe the differences between these two images."}
+            ]
         }
     ]
+
     text = processor.apply_chat_template(messages, add_generation_prompt=True)

     if os.path.exists(image_path):
@@ -65,7 +80,10 @@ if __name__ == '__main__':
     else:
        image = Image.open(requests.get(image_path, stream=True).raw)

-    inputs = processor(text=text, images=image, return_tensors="pt").to('xpu')
+    # inputs = processor(text=text, images=image, return_tensors="pt").to('xpu')
+    # multi-image chat debug
+    image_2 = Image.open(requests.get("http://farm5.staticflickr.com/4031/4440753665_631134eaa4_z.jpg", stream=True).raw)
+    inputs = processor(text=text, images=[image, image_2], return_tensors="pt").to('xpu')

Env check output log is attached: env-check.txt

JinheTang commented 2 weeks ago

Hi @Johere , thanks for the script, we will try to reproduce it.

JinheTang commented 2 weeks ago

Hi @Johere , we have reproduced the issue. If there's any update we will let you know.

Oscilloscope98 commented 1 week ago

Hi @Johere,

Sorry for the late reply. We have fixed this bug in ipex-llm>=2.2.0b20241113. You could have a try with latest ipex-llm :)

Please let us know for any further problems.

intel-analytics / ipex-llm

llava-hf/llava-1.5-7b-hf: error when multi-turn chat with multi-images #12288