Closed Xiaohui9607 closed 3 months ago
cc @zucchini-nlp
@Xiaohui9607 hey!
The helper functiions' code for Llava-1.6 was totally copied from the original implementation and was tested for equivalence in generation between out version and the original one.
I think you're right, and it seems that num_patch_width
, num_patch_height
are switched at first sight, but I think it's better to address this question to the authors.
Regarding the shortest edge which is 12, it's expected as the aspect ratio can be 2x1 and 3x1 and we know that the "1" here is always 24. So when we adjust the original aspect ratio and unpad the image, we can get 12 and smaller values for height/width.
for 2x1 or 3x1 ratio, yes 1 is always 24 and it should always be the shortest edge. then 2/3 will correspond to the long edge thus should have a number of patch that is bigger than 24 to maintain the original ratio. For example, for an image (400,800), the best res should be (336,672). Then the short edge is rescaled to 336 (24 tokens) and the long edge will be (336/400*800/24 = 48 tokens).
Oh, yes, you're right! This is stemming from the prev observation that num_patch_width
, num_patch_height
are switched. I tried to do a few tests by switching it back to get correct shapes, and didn't actually notice quality difference for high-res images.
I will open an issue in project repo and ask the authors if it's intended, and fix in transformers in case it's a bug. I'll tag you there :)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
I am checking the source code of llava-next, particularly the file modeling_llava_next.py
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
I modify the following source code with three printings in this file, the code starts from line 656
I think the variable assignment num_patch_width, num_patch_height should be switched since for resolution that's not rectangle, I have some strange results. To reproduce, I just try different image resolution: case 1 (res=(400,900))
output:
case 2 (res=(400,800))
output:
case 3 (res=(1024,899))
output:
Expected behavior
I feel like the shortest edge should be at least 24, yet I got 12 for some cases.