Potential Bug in llava_next when calling pack_image_features function.

yinsong1986 commented 3 months ago

System Info

To the best of my understanding, https://github.com/huggingface/transformers/blob/12b1620e615592fbf099d4ec44af7b9f2d1b48aa/src/transformers/models/llava_next/modeling_llava_next.py#L656 the output from this function should be (height, width), so it should be changed to num_patch_height, num_patch_width = get_anyres_image_grid_shape(. Any thought? Thank you!

Who can help?

@amyeroberts

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

https://github.com/huggingface/transformers/blob/12b1620e615592fbf099d4ec44af7b9f2d1b48aa/src/transformers/models/llava_next/modeling_llava_next.py#L656 the output from this function should be (height, width), so it should be changed to num_patch_height, num_patch_width = get_anyres_image_grid_shape(

Expected behavior

https://github.com/huggingface/transformers/blob/12b1620e615592fbf099d4ec44af7b9f2d1b48aa/src/transformers/models/llava_next/modeling_llava_next.py#L656 the output from this function should be (height, width), so it should be changed to num_patch_height, num_patch_width = get_anyres_image_grid_shape(

zucchini-nlp commented 3 months ago

Same as #31327. I asked the authors of LLaVa-NeXT and didn't get any reply yet.

For me it also look like swapped and should be the other way, but since that is how LLaVa-NeXT authors implemented it in their repo and I didn't see much difference by running a few examples between the two swaps, I decided to not flag it as a bug yet and wait for the authors' reply.

Let me know if you ran an evaluation and found that swapping back to num_patch_height, num_patch_width is better in some (OCR, high-res images?) or all tasks!

cc @NielsRogge also, who added the model

yinsong1986 commented 3 months ago

Hi @zucchini-nlp Thanks for your reply!

I think in the original implementation, they kind of keeping the order as (width, height), but for this hf implementation, you kind of keep the order as (height, width) almost everywhere. An example of comparing the two can be found below:

so your current implementation probably is not quite implemented same as the original implement, as far as I understand it :)

Pls correct me if I am wrong.Thanks!

zucchini-nlp commented 3 months ago

@yinsong1986 Right, I didn't notice that first time I looked. So, now I did more digging and compared both implementations. From what I see, there is no bug and it's simply the naming which is a bit different from the original LLaVarepo.

If we compare select_best_resolution as you pointed out, the height and width are swapped (only names since the resulting best resolution is same regardless of how you call it). Later in this piece of code we still follow the "height, width" naming,

https://github.com/huggingface/transformers/blob/730a440734e1fb47c903c17e3231dac18e3e5fd6/src/transformers/models/llava_next/modeling_llava_next.py#L73-L74

but we swap back the names as it should be here

https://github.com/huggingface/transformers/blob/730a440734e1fb47c903c17e3231dac18e3e5fd6/src/transformers/models/llava_next/modeling_llava_next.py#L656

So if my understanding is correct, at the end we end up with the width and height in the places where they should be. Also we ran an equivalence test between two implementations and got nearly same logits, which I believe supports my claim that it's not a bug.

But I agree that it's quite counter-intuitive to see a sudden swap between the two in above lines. I will fix the naming next week :)

yinsong1986 commented 3 months ago

Thank you and look forward to the updated code!

yinsong1986 commented 3 months ago

@zucchini-nlp

FYI: in the original implementation from https://github.com/LLaVA-VL/LLaVA-NeXT, they didn't do any swap of the (width, height), when calling get_anyres_image_grid_shape. The source code is as below:

Hope it helps with your refactoring code. Thank you!

huggingface / transformers