Confusion about LlavaNextImageProcessor results

NaNillll commented 4 months ago

System Info

Transformers==4.40.0 torch==2.0.1

Who can help?

No response

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[x] My own task or dataset (give details below)

Reproduction

Very simple `

processor = LlavaNextProcessor.from_pretrained("/home/vllm/llava-hfllava-v1.6-34b-hf")
image = Image.open("llava_v1_5_radar.jpg")
image = image.resize((336,336)) 
prompt = "<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n" + "<image>" * 2144 + "\n" + "What is shown in this image?<|im_end|><|im_start|>assistant\n"
img_ = processor(prompt, image, return_tensors='pt')['pixel_values']

`

Expected behavior

Acording to papers of llava-next, after pre-processor, the results is [5,336,336,3], which means the image is encode into 5 parts. (Sorry I can not upload image, you can refer this https://llava-vl.github.io/blog/assets/images/llava-1-6/high_res_arch_v2.png)

Howerver, I find when using small image such as [336,336], the results is [3,336,336,3]; but when using large image like [1920,1080], the result is [5,336,336,3], why? This difference in size causes errors in my LLM framework. Since I don't know how the clipping works, I don't know how to modify it......

What is the principle of this module? The documentation doesn't explain in too much detail. I am a nood to llava. Thanks for your instructions!

amyeroberts commented 4 months ago

cc @NielsRogge

NaNillll commented 4 months ago

I think the result should be related to the aspect ratio, I read the config of LLAVA-next 34b and find three ratio from "image_grid_pinpoints", 1:1, 1:2, 1:3. It looks like LLAVA will aligned the image to an optimal ratio, crop and sample. If aligne to 1:1, it will [2x2+1, 336,336]; if 1:2, [2x1+1, 336,336]; if 1:3, [3x1+1, 336,336]

So I try different image, and find it is right: input->output [672, 1344] -> [3,336,336] [1344, 2688 -> [3,336,336] [672, 672] -> [5,336,336] [336, 1008] -> [4,336,336]

However, when I input [336, 336], I think it should be [5,336,336]. But it output [3,336,336]. Why? I think it should be aligned to 1:1 I also try: [335,335] -> [3,336,336] [337,337] -> [5,336,336] It looks like 336 is the turning point.

The config I use:

{
  "architectures": [
    "LlavaNextForConditionalGeneration"
  ],
  "ignore_index": -100,
  "image_grid_pinpoints": [
    [
      336,
      672
    ],
    [
      672,
      336
    ],
    [
      672,
      672
    ],
    [
      1008,
      336
    ],
    [
      336,
      1008
    ]
  ],
  "image_token_index": 64000,
  "model_type": "llava_next",
  "projector_hidden_act": "gelu",
  "text_config": {
    "_name_or_path": "NousResearch/Nous-Hermes-2-Yi-34B",
    "architectures": [
      "LlamaForCausalLM"
    ],
    "eos_token_id": 7,
    "hidden_size": 7168,
    "intermediate_size": 20480,
    "max_position_embeddings": 4096,
    "model_type": "llama",
    "num_attention_heads": 56,
    "num_hidden_layers": 60,
    "num_key_value_heads": 8,
    "pad_token_id": 0,
    "rms_norm_eps": 1e-05,
    "rope_theta": 5000000.0,
    "torch_dtype": "bfloat16",
    "use_cache": false,
    "vocab_size": 64064
  },
  "torch_dtype": "float16",
  "transformers_version": "4.39.0.dev0",
  "use_image_newline_parameter": true,
  "vision_config": {
    "hidden_size": 1024,
    "image_size": 336,
    "intermediate_size": 4096,
    "model_type": "clip_vision_model",
    "num_attention_heads": 16,
    "num_hidden_layers": 24,
    "patch_size": 14,
    "projection_dim": 768,
    "vocab_size": 32000
  },
  "vision_feature_layer": -2,
  "vision_feature_select_strategy": "default",
  "vocab_size": 64064
}

NaNillll commented 4 months ago

I think the best ratio is select by this function, and input original_size (336,336) get best_fit [336,672]

def select_best_resolution(original_size: tuple, possible_resolutions: list) -> tuple:
    """
    Selects the best resolution from a list of possible resolutions based on the original size.

    This is done by calculating the effective and wasted resolution for each possible resolution.

    The best fit resolution is the one that maximizes the effective resolution and minimizes the wasted resolution.

    Args:
        original_size (tuple):
            The original size of the image in the format (height, width).
        possible_resolutions (list):
            A list of possible resolutions in the format [(height1, width1), (height2, width2), ...].

    Returns:
        tuple: The best fit resolution in the format (height, width).
    """
    original_height, original_width = original_size
    best_fit = None
    max_effective_resolution = 0
    min_wasted_resolution = float("inf")

    for height, width in possible_resolutions:
        scale = min(width / original_width, height / original_height)
        downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale)
        effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height)
        wasted_resolution = (width * height) - effective_resolution

        if effective_resolution > max_effective_resolution or (
            effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution
        ):
            max_effective_resolution = effective_resolution
            min_wasted_resolution = wasted_resolution
            best_fit = (height, width)

    return best_fit

zucchini-nlp commented 4 months ago

@NaNillll you are right, the different resolutions are used to fit better different types of image aspect ratios and preserve more details.

Regarding the 336 case, it selected (336, 672) for best resolution because the function tries to preserve all of the image and not to crop it at the first place. That's why (337, 337) gets (672, 672) for best resolution, and not (336, 672), because the latter will require cropping the image a bit. And as you pointed out, different best resolutions are further divided into 3, 4 or 5 patches depending on aspect ratio.

Correct me if I'm wrong @NielsRogge

NaNillll commented 4 months ago

@NaNillll you are right, the different resolutions are used to fit better different types of image aspect ratios and preserve more details.

Regarding the 336 case, it selected (336, 672) for best resolution because the function tries to preserve all of the image and not to crop it at the first place. That's why (337, 337) gets (672, 672) for best resolution, and not (336, 672), because the latter will require cropping the image a bit. And as you pointed out, different best resolutions are further divided into 3, 4 or 5 patches depending on aspect ratio.

Correct me if I'm wrong @NielsRogge

OK, thank you!

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers