Open insujang opened 1 month ago
Hey @insujang! Right, this check passes only when CLS token is removed from image representation. We can check that height * width == 576
if selection strategy is default. Otherwise check that it is height * width + 1
. Would you like to open a PR for that?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
transformers==4.42.3
Who can help?
@zucchini-nlp
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Working example: using
default
vision_feature_select_strategyBuggy example with
full
vision_feature_select_strategyBack to the main bug example, the reason of such error is due to the following lines: https://github.com/huggingface/transformers/blob/c1aa0edb48217f416f4bbe6e3a9db1500284513b/src/transformers/models/llava_next/modeling_llava_next.py#L789-L792
Here,
selected_image_feature
shape before applying the select strategy is[5, 577, 1024]
. If we usedefault
select strategy one feature is removed from the second dimension, resulting in[5, 576, 1024]
feature dimension, which can pass the condition inself.pack_image_features()
: https://github.com/huggingface/transformers/blob/c1aa0edb48217f416f4bbe6e3a9db1500284513b/src/transformers/models/llava_next/modeling_llava_next.py#L662-L663 withheight=width=24
thusheight * weight = 576
.However, if
full
select strategy is used we have the feature with unchanged dimension[5, 577, 1024]
which raisesValueError
inself.pack_image_features()
asheight * width == 576 != 577
.Expected behavior
Runs without an error, potentially having a different output from that with
default
strategy.