haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.42k stars 2.26k forks source link

[Usage] ValueError #1748

Open t-mockbel opened 4 weeks ago

t-mockbel commented 4 weeks ago

Describe the issue

Issue: I'm trying to use llava-1.5-7b-hf and i'm new and clueless in debugging LMMs. Ihave an error when i try to use the simple example of usage: raise ValueError( ValueError: The input provided to the model are wrong. The number of image tokens is 100 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.

And i really don't get it.

Command: port requests from PIL import Image import torch from transformers import AutoProcessor, LlavaForConditionalGeneration

model_id = "llava-hf/llava-1.5-7b-hf" model = LlavaForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.float16 ).to(0)

processor = AutoProcessor.from_pretrained(model_id, patch_size = 32 , vision_feature_select_strategy = 'default') image_file = "http://images.cocodataset.org/val2017/000000039769.jpg" raw_image = Image.open(requests.get(image_file, stream=True).raw) conversation = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "What is shown in this image?"}, ], }, ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)

inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, dtype=torch.float16)

output = model.generate(**inputs, max_new_tokens=200, do_sample=False) print(processor.decode(output[0][2:], skip_special_tokens=True))

BUT I ALSO HAVE THIS: envs\myenv\lib\site-packages\transformers\models\clip\modeling_clip.py:540: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.) attn_output = torch.nn.functional.scaled_dot_product_attention( Expanding inputs for image tokens in LLaVa should be done in processing . Please add patch_size and vision_feature_select_strategy to the m odel's processing config or set directly with processor.patch_size = { {patch_size}} and processor.vision_feature_selectstrategy = {{vision feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.

AsteriaCao commented 3 weeks ago

I solved this problem by adding 2 lines when in llava-1.5-7b-hf initialization:

self.processor.patch_size = self.model.config.vision_config.patch_size

self.processor.vision_feature_select_strategy = self.model.config.vision_feature_select_strategy

The code above means that I point out the patch_size and vision_feature_select_strategy manually using the same values from model.config.