llava-next, any resolution bug?

Xiaohui9607 commented 5 months ago

System Info

I am checking the source code of llava-next, particularly the file modeling_llava_next.py

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I modify the following source code with three printings in this file, the code starts from line 656

num_patch_width, num_patch_height = get_anyres_image_grid_shape(
    image_sizes[image_idx],
    self.config.image_grid_pinpoints,
    self.config.vision_config.image_size,
)
print(num_patch_width, num_patch_height)
print(image_feature.shape)
image_feature = image_feature.view(num_patch_height, num_patch_width, height, width, -1)
image_feature = image_feature.permute(4, 0, 2, 1, 3).contiguous()
image_feature = image_feature.flatten(1, 2).flatten(2, 3)
image_feature = unpad_image(image_feature, image_sizes[image_idx])
print(image_feature.shape)

I think the variable assignment num_patch_width, num_patch_height should be switched since for resolution that's not rectangle, I have some strange results. To reproduce, I just try different image resolution: case 1 (res=(400,900))

image = Image.open('llava_v1_5_radar.jpg').resize((400,900))
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
generation_output = model.generate(**inputs, max_new_tokens=100, output_attentions=True, return_dict_in_generate=True)

output:

3 1
torch.Size([3, 576, 4096])
torch.Size([4096, 24, 10])

case 2 (res=(400,800))

image = Image.open('llava_v1_5_radar.jpg').resize((400,800))
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
generation_output = model.generate(**inputs, max_new_tokens=100, output_attentions=True, return_dict_in_generate=True)

output:

2 1
torch.Size([2, 576, 4096])
torch.Size([4096, 24, 12])

case 3 (res=(1024,899))

image = Image.open('llava_v1_5_radar.jpg')
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
inputs = processor(prompt, image, return_tensors="pt").to("cuda:0")
generation_output = model.generate(**inputs, max_new_tokens=100, output_attentions=True, return_dict_in_generate=True)

output:

2 2
torch.Size([4, 576, 4096])
torch.Size([4096, 42, 48])

Expected behavior

I feel like the shortest edge should be at least 24, yet I got 12 for some cases.

amyeroberts commented 5 months ago

cc @zucchini-nlp

zucchini-nlp commented 5 months ago

@Xiaohui9607 hey!

The helper functiions' code for Llava-1.6 was totally copied from the original implementation and was tested for equivalence in generation between out version and the original one.

I think you're right, and it seems that num_patch_width, num_patch_height are switched at first sight, but I think it's better to address this question to the authors.

Regarding the shortest edge which is 12, it's expected as the aspect ratio can be 2x1 and 3x1 and we know that the "1" here is always 24. So when we adjust the original aspect ratio and unpad the image, we can get 12 and smaller values for height/width.

Xiaohui9607 commented 5 months ago

for 2x1 or 3x1 ratio, yes 1 is always 24 and it should always be the shortest edge. then 2/3 will correspond to the long edge thus should have a number of patch that is bigger than 24 to maintain the original ratio. For example, for an image (400,800), the best res should be (336,672). Then the short edge is rescaled to 336 (24 tokens) and the long edge will be (336/400*800/24 = 48 tokens).

zucchini-nlp commented 5 months ago

Oh, yes, you're right! This is stemming from the prev observation that num_patch_width, num_patch_height are switched. I tried to do a few tests by switching it back to get correct shapes, and didn't actually notice quality difference for high-res images.

I will open an issue in project repo and ask the authors if it's intended, and fix in transformers in case it's a bug. I'll tag you there :)

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers