Closed NaNillll closed 1 month ago
cc @NielsRogge
I think the result should be related to the aspect ratio, I read the config of LLAVA-next 34b and find three ratio from "image_grid_pinpoints", 1:1, 1:2, 1:3. It looks like LLAVA will aligned the image to an optimal ratio, crop and sample. If aligne to 1:1, it will [2x2+1, 336,336]; if 1:2, [2x1+1, 336,336]; if 1:3, [3x1+1, 336,336]
So I try different image, and find it is right: input->output [672, 1344] -> [3,336,336] [1344, 2688 -> [3,336,336] [672, 672] -> [5,336,336] [336, 1008] -> [4,336,336]
However, when I input [336, 336], I think it should be [5,336,336]. But it output [3,336,336]. Why? I think it should be aligned to 1:1 I also try: [335,335] -> [3,336,336] [337,337] -> [5,336,336] It looks like 336 is the turning point.
The config I use:
{
"architectures": [
"LlavaNextForConditionalGeneration"
],
"ignore_index": -100,
"image_grid_pinpoints": [
[
336,
672
],
[
672,
336
],
[
672,
672
],
[
1008,
336
],
[
336,
1008
]
],
"image_token_index": 64000,
"model_type": "llava_next",
"projector_hidden_act": "gelu",
"text_config": {
"_name_or_path": "NousResearch/Nous-Hermes-2-Yi-34B",
"architectures": [
"LlamaForCausalLM"
],
"eos_token_id": 7,
"hidden_size": 7168,
"intermediate_size": 20480,
"max_position_embeddings": 4096,
"model_type": "llama",
"num_attention_heads": 56,
"num_hidden_layers": 60,
"num_key_value_heads": 8,
"pad_token_id": 0,
"rms_norm_eps": 1e-05,
"rope_theta": 5000000.0,
"torch_dtype": "bfloat16",
"use_cache": false,
"vocab_size": 64064
},
"torch_dtype": "float16",
"transformers_version": "4.39.0.dev0",
"use_image_newline_parameter": true,
"vision_config": {
"hidden_size": 1024,
"image_size": 336,
"intermediate_size": 4096,
"model_type": "clip_vision_model",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"patch_size": 14,
"projection_dim": 768,
"vocab_size": 32000
},
"vision_feature_layer": -2,
"vision_feature_select_strategy": "default",
"vocab_size": 64064
}
I think the best ratio is select by this function, and input original_size (336,336) get best_fit [336,672]
def select_best_resolution(original_size: tuple, possible_resolutions: list) -> tuple:
"""
Selects the best resolution from a list of possible resolutions based on the original size.
This is done by calculating the effective and wasted resolution for each possible resolution.
The best fit resolution is the one that maximizes the effective resolution and minimizes the wasted resolution.
Args:
original_size (tuple):
The original size of the image in the format (height, width).
possible_resolutions (list):
A list of possible resolutions in the format [(height1, width1), (height2, width2), ...].
Returns:
tuple: The best fit resolution in the format (height, width).
"""
original_height, original_width = original_size
best_fit = None
max_effective_resolution = 0
min_wasted_resolution = float("inf")
for height, width in possible_resolutions:
scale = min(width / original_width, height / original_height)
downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale)
effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height)
wasted_resolution = (width * height) - effective_resolution
if effective_resolution > max_effective_resolution or (
effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution
):
max_effective_resolution = effective_resolution
min_wasted_resolution = wasted_resolution
best_fit = (height, width)
return best_fit
@NaNillll you are right, the different resolutions are used to fit better different types of image aspect ratios and preserve more details.
Regarding the 336 case, it selected (336, 672) for best resolution because the function tries to preserve all of the image and not to crop it at the first place. That's why (337, 337) gets (672, 672) for best resolution, and not (336, 672), because the latter will require cropping the image a bit. And as you pointed out, different best resolutions are further divided into 3, 4 or 5 patches depending on aspect ratio.
Correct me if I'm wrong @NielsRogge
@NaNillll you are right, the different resolutions are used to fit better different types of image aspect ratios and preserve more details.
Regarding the 336 case, it selected (336, 672) for best resolution because the function tries to preserve all of the image and not to crop it at the first place. That's why (337, 337) gets (672, 672) for best resolution, and not (336, 672), because the latter will require cropping the image a bit. And as you pointed out, different best resolutions are further divided into 3, 4 or 5 patches depending on aspect ratio.
Correct me if I'm wrong @NielsRogge
OK, thank you!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
Transformers==4.40.0 torch==2.0.1
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Very simple `
`
Expected behavior
Acording to papers of llava-next, after pre-processor, the results is [5,336,336,3], which means the image is encode into 5 parts. (Sorry I can not upload image, you can refer this https://llava-vl.github.io/blog/assets/images/llava-1-6/high_res_arch_v2.png)
Howerver, I find when using small image such as [336,336], the results is [3,336,336,3]; but when using large image like [1920,1080], the result is [5,336,336,3], why? This difference in size causes errors in my LLM framework. Since I don't know how the clipping works, I don't know how to modify it......
What is the principle of this module? The documentation doesn't explain in too much detail. I am a nood to llava. Thanks for your instructions!