Closed yinsong1986 closed 1 month ago
Same as #31327. I asked the authors of LLaVa-NeXT and didn't get any reply yet.
For me it also look like swapped and should be the other way, but since that is how LLaVa-NeXT authors implemented it in their repo and I didn't see much difference by running a few examples between the two swaps, I decided to not flag it as a bug yet and wait for the authors' reply.
Let me know if you ran an evaluation and found that swapping back to num_patch_height, num_patch_width
is better in some (OCR, high-res images?) or all tasks!
cc @NielsRogge also, who added the model
Hi @zucchini-nlp Thanks for your reply!
I think in the original implementation, they kind of keeping the order as (width, height), but for this hf implementation, you kind of keep the order as (height, width) almost everywhere. An example of comparing the two can be found below:
so your current implementation probably is not quite implemented same as the original implement, as far as I understand it :)
Pls correct me if I am wrong.Thanks!
@yinsong1986 Right, I didn't notice that first time I looked. So, now I did more digging and compared both implementations. From what I see, there is no bug and it's simply the naming which is a bit different from the original LLaVarepo.
If we compare select_best_resolution
as you pointed out, the height and width are swapped (only names since the resulting best resolution is same regardless of how you call it). Later in this piece of code we still follow the "height, width" naming,
but we swap back the names as it should be here
So if my understanding is correct, at the end we end up with the width and height in the places where they should be. Also we ran an equivalence test between two implementations and got nearly same logits, which I believe supports my claim that it's not a bug.
But I agree that it's quite counter-intuitive to see a sudden swap between the two in above lines. I will fix the naming next week :)
Thank you and look forward to the updated code!
@zucchini-nlp
FYI: in the original implementation from https://github.com/LLaVA-VL/LLaVA-NeXT, they didn't do any swap of the (width, height), when calling get_anyres_image_grid_shape. The source code is as below:
Hope it helps with your refactoring code. Thank you!
System Info
To the best of my understanding, https://github.com/huggingface/transformers/blob/12b1620e615592fbf099d4ec44af7b9f2d1b48aa/src/transformers/models/llava_next/modeling_llava_next.py#L656 the output from this function should be (height, width), so it should be changed to
num_patch_height, num_patch_width = get_anyres_image_grid_shape(
. Any thought? Thank you!Who can help?
@amyeroberts
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
https://github.com/huggingface/transformers/blob/12b1620e615592fbf099d4ec44af7b9f2d1b48aa/src/transformers/models/llava_next/modeling_llava_next.py#L656 the output from this function should be (height, width), so it should be changed to
num_patch_height, num_patch_width = get_anyres_image_grid_shape(
Expected behavior
https://github.com/huggingface/transformers/blob/12b1620e615592fbf099d4ec44af7b9f2d1b48aa/src/transformers/models/llava_next/modeling_llava_next.py#L656 the output from this function should be (height, width), so it should be changed to
num_patch_height, num_patch_width = get_anyres_image_grid_shape(