LLaVa Left Padding Got Weird Results

SeungyounShin commented 9 months ago

System Info

Reproduce :

from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration

model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf").to(
    "cuda"
)
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\nUSER: Describe the image.\nASSISTANT:"
prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)

inputs = processor(
    text=[prompt1, prompt2, prompt3],
    images=[image1, image2, image1, image2],
    return_tensors="pt",
    padding=True,
)
for key in inputs:
    inputs[key] = inputs[key].to("cuda")
    print(key, inputs[key].shape)

# Generate
generate_ids = model.generate(**inputs, max_length=512)
outputs = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(outputs)

This will outputs :

["\n \nUSER: What's the the difference of two images?\nASSISTANT: In the two images, the primary difference is the presence of a flower in the dog's mouth. In the first image, the dog is holding a flower in its mouth, while in the second image, the dog is not holding a flower. This subtle change in the scene highlights the dog's interaction with the flower, and it may evoke different emotions or interpretations depending on the viewer's perspective.", '\nUSER: Describe the image.\nASSISTANT: The dog is a \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n', '\nUSER: Describe the image.\nASSISTANT: The \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nЪ schließ']

I checked images are rightly placed. but for batch2 and 3 It's consist of lots of padding (False x 583) [False x 583, False, True x 576 , False, False, False, False, False, False, False, False, False, False, False, False, False, False]

I guess llava doesn't see this kind of prefix on training phase would result in weird behavior.

Who can help?

No response

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

stated at above

Expected behavior

skip

amyeroberts commented 9 months ago

cc @younesbelkada @ArthurZucker

younesbelkada commented 9 months ago

hi @SeungyounShin What transformers version are you using? in the first input prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:" you passed two images; note multi-image query is not well supported for Llava-like models as they have not excplicitly trained for that according to the authors.

younesbelkada commented 9 months ago

btw you can also to inputs = inputs.to("cuda")

SeungyounShin commented 9 months ago

I am currently using 4.37.0.dev0

prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\n<image>\nUSER: Describe the two images.\nASSISTANT:"
# prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)

inputs = processor(
    text=[prompt1, prompt2],
    images=[image1, image2, image1, image2],
    return_tensors="pt",
    padding=True,
)

This will output :

 [1]
USER: What's the the difference of two images?
ASSISTANT: In the two images, the primary difference is the presence of a flower in the dog's mouth. In the first image, the dog is holding a flower in its mouth, while in the second image, the dog is not holding a flower. This subtle change in the scene highlights the dog's interaction with the flower, and it may evoke different emotions or interpretations depending on the viewer's perspective.

 [2]
USER: Describe the two images.
ASSISTANT: The two images show a cute brown and white dog standing on a grassy hill. In one image, the dog is holding a green leaf in its mouth, while in the other, it is holding a yellow flower. Both images capture the dog's playful and curious nature as it interacts with its surroundings.

The implementation appears to be functioning correctly. Upon reviewing, I noticed that the final embeddingl effectively supports multiple images.

SeungyounShin commented 9 months ago

modeling_llava.py#L304 is this expected behavior?

Considering the relationship between image patches. Specifically, if image patch 100 references image patch 84, it appears there shouldn't be any issue. I haven't come across any mention of masking related to image patches in the LLaVa paper. Is this approach used in the official implementation of LLaVa?

**It would be beneficial to have an example of fine-tuning for multi-images. Would you be open to accepting a Pull Request (PR) that includes an example of fine-tuning on multi-images?

younesbelkada commented 9 months ago

Hi @SeungyounShin Indeed it seems you are correct, despite the model not being explicitly trained for this, it seems to perform well on some examples as you shared, which is very nice! cc @haotian-liu for visibility! I suspect something is off with SDPA (torch.scaled_dot_product_attention not being able to deal with arbitraty attention masks. I need some time to properly investigate how to fix this. Meanwhile you can do two things 1- Use the eager attention implementation:

from PIL import Image
import requests
from transformers import AutoProcessor, LlavaForConditionalGeneration

model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf").to(
+ model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf", attn_implementation="eager").to(
    "cuda"
)
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")

prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\nUSER: Describe the image.\nASSISTANT:"
prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)

inputs = processor(
    text=[prompt1, prompt2, prompt3],
    images=[image1, image2, image1, image2],
    return_tensors="pt",
    padding=True,
)
for key in inputs:
    inputs[key] = inputs[key].to("cuda")
    print(key, inputs[key].shape)

# Generate
generate_ids = model.generate(**inputs, max_length=512)
outputs = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(outputs)

2- Process the prompts one-by-one instead of performing batched generation

cc @fxmarty as well as this is about SDPA

fxmarty commented 8 months ago

@younesbelkada is this in the end not related to sdpa?

younesbelkada commented 7 months ago

@fxmarty I think it is related to SDPA as Llava model creates non-standard attention mask and the script fails for SDPA

haotian-liu commented 7 months ago

@younesbelkada i also found similar issue when i tried to implement batch inference. do you know why it creates non-standard attention mask? it should theoretically use the standard autoregressive mask?

younesbelkada commented 7 months ago

@haotian-liu I think this happens in the case you try to have different numbers of images per prompt + multi-turn chat. If let's say you have 2 images in the first prompt and one image on the second prompt, your attention mask will look like

[image 1] [prompt 1] [image 2] [prompt 2]
0 0 0.. 0  1 1 1 1 1 .. 1 0 0 0 ... 0 1 1 1 1 1 ... 1
[image 3] [prompt 3]
0 0 0.. 0  1 1 1 1 1 .. 1

I think the reason that for the prompt

prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\nUSER: Describe the image.\nASSISTANT:"
prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)

inputs = processor(
    text=[prompt1, prompt2, prompt3],
    images=[image1, image2, image1, image2],
    return_tensors="pt",
    padding=True,
)

We are getting a non-standard attention mask is the presence of \n between the two <image> tokens for prompt1. Can you try out the following:

- prompt1 = "<image>\n<image>\nUSER: What's the the difference of two images?\nASSISTANT:"
+ prompt1 = "<image><image>\nUSER: What's the the difference of two images?\nASSISTANT:"
prompt2 = "<image>\nUSER: Describe the image.\nASSISTANT:"
prompt3 = "<image>\nUSER: Describe the image.\nASSISTANT:"
url1 = "https://images.unsplash.com/photo-1552053831-71594a27632d?q=80&w=3062&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
url2 = "https://images.unsplash.com/photo-1617258683320-61900b281ced?q=80&w=3087&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
image1 = Image.open(requests.get(url1, stream=True).raw)
image2 = Image.open(requests.get(url2, stream=True).raw)

inputs = processor(
    text=[prompt1, prompt2, prompt3],
    images=[image1, image2, image1, image2],
    return_tensors="pt",
    padding=True,
)

That way the attention mask will become standard I believe cc @haotian-liu what do you think?

haotian-liu commented 7 months ago

@younesbelkada Thank you! i thought it may be due to a different reason, as the strange behavior occured when I previously tried to do batch inference with one image for each sample. I'll try to find another example later to see if it still exists.

fxmarty commented 6 months ago

Hi, this should be fixed by https://github.com/huggingface/transformers/pull/29389. Could you give a second try? Thank you for the report!

huggingface / transformers