Open t-mockbel opened 4 weeks ago
I solved this problem by adding 2 lines when in llava-1.5-7b-hf initialization:
self.processor.patch_size = self.model.config.vision_config.patch_size
self.processor.vision_feature_select_strategy = self.model.config.vision_feature_select_strategy
The code above means that I point out the patch_size and vision_feature_select_strategy manually using the same values from model.config.
Describe the issue
Issue: I'm trying to use llava-1.5-7b-hf and i'm new and clueless in debugging LMMs. Ihave an error when i try to use the simple example of usage: raise ValueError( ValueError: The input provided to the model are wrong. The number of image tokens is 100 while the number of image given to the model is 1. This prevents correct indexing and breaks batch generation.
And i really don't get it.
Command: port requests from PIL import Image import torch from transformers import AutoProcessor, LlavaForConditionalGeneration
model_id = "llava-hf/llava-1.5-7b-hf" model = LlavaForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.float16 ).to(0)
processor = AutoProcessor.from_pretrained(model_id, patch_size = 32 , vision_feature_select_strategy = 'default') image_file = "http://images.cocodataset.org/val2017/000000039769.jpg" raw_image = Image.open(requests.get(image_file, stream=True).raw) conversation = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "What is shown in this image?"}, ], }, ] prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
inputs = processor(images=raw_image, text=prompt, return_tensors='pt').to(0, dtype=torch.float16)
output = model.generate(**inputs, max_new_tokens=200, do_sample=False) print(processor.decode(output[0][2:], skip_special_tokens=True))
BUT I ALSO HAVE THIS: envs\myenv\lib\site-packages\transformers\models\clip\modeling_clip.py:540: UserWarning: 1Torch was not compiled with flash attention. (Triggered internally at C:\cb\pytorch_1000000000000\work\aten\src\ATen\native\transformers\cuda\sdp_utils.cpp:455.) attn_output = torch.nn.functional.scaled_dot_product_attention( Expanding inputs for image tokens in LLaVa should be done in processing . Please add patch_size and vision_feature_select_strategy to the m odel's processing config or set directly with processor.patch_size = { {patch_size}} and processor.vision_feature_selectstrategy = {{vision feature_select_strategy}}`. Using processors without these attributes in the config is deprecated and will throw an error in v4.47.