Image size understanding in DinoV2 and Transformers generally

lombardata commented 6 months ago

Feature request

Hi everyone, I was playing with Dinov2 model of the transformers library of HF and I have a question : is there a way to change model input image sizes like in the timm library? i.e. the 11 August they added here : https://github.com/huggingface/pytorch-image-models an option to change input img sizes e.g. "Example validation cmd to test w/ non-square resize python validate.py /imagenet --model swin_base_patch4_window7_224.ms_in22k_ft_in1k --amp --amp-dtype bfloat16 --input-size 3 256 320 --model-kwargs window_size=8,10 img_size=256,320" Is there a way to do the same with the transformers library? I tried to change the image_size in the config.json file, but since the image is then processed by the processor, in my understanding, the output would be always the one of the "crop_size" parameter in the preprocessor_config.json What would be the best practice in order to fill an entire image to the model (if there is a way)? Thank you all in advance!

Motivation

add custom image input size like in timm

Your contribution

timm is a hf library so it would be easy to integrate this function to transformers lib

amyeroberts commented 6 months ago

Hi @lombardata,

You can specify both the size the image is resized to during the resize call and the crop size e.g.:

image_processor = AutoImageProcessor.from_pretrained(checkpoint, crop_size={"height": 320, "width": 256})

note: I don't know if in this timm example if the dimensions are in (h,w) or (w,h) order.

lombardata commented 6 months ago

hi @amyeroberts , thank you very much for your quick reply. In this case what is the "meaning" of the image_size parameter in the model config file? Let's say that my images are 1080 x 1920 and I process them with : image_processor = AutoImageProcessor(checkpoint, crop_size={"height": 1080, "width": 1920}) then if in the model config.json file I keep the default parameter : "image_size": 518, What would be the behaviour of the training? Thank you a lot !

amyeroberts commented 6 months ago

You're hitting on one of the tricky coupling issues between models and their data!

For the image processor, crop_size and size control the processing logic. This is independent of the model and modifying the behaviour of the image processor won't automatically update the necessary model params.

For the DinoV2 model, image_size refers to the input image size i.e. the dimensions of the processed images. It controls how the patches are extracted when creating the embeddings. However, the model employs interpolation on the embeddings if the input image is of a different resolution. So for inference, you should be able to pass in different sized images and run things fine.

If you want to train a model, then I'd suggest aligning the processor and model configurations.

To align the two, you'll need to do something like this:

from transformers import Dinov2Config, Dinov2ForImageClassification, AutoImageProcessor

image_height, image_width = 1080, 1920

checkpoint = "facebook/dinov2-base"

# Create a new model with randomly initialized weights
model_config = Dinov2Config.from_pretrained(checkpoint, image_size=(image_height, image_width))
model = Dinov2ForImageClassification(model_config)
image_processor = AutoImageProcessor.from_pretrained(
    checkpoint, image_size={"height": image_heaight, "width": image_width}
)

lombardata commented 6 months ago

You're hitting on one of the tricky coupling issues between models and their data!

For the image processor, crop_size and size control the processing logic. This is independent of the model and modifying the behaviour of the image processor won't automatically update the necessary model params.

For the DinoV2 model, image_size refers to the input image size i.e. the dimensions of the processed images. It controls how the patches are extracted when creating the embeddings. However, the model employs interpolation on the embeddings if the input image is of a different resolution. So for inference, you should be able to pass in different sized images and run things fine.

If you want to train a model, then I'd suggest aligning the processor and model configurations.

To align the two, you'll need to do something like this:
from transformers import Dinov2Config, Dinov2ForImageClassification, AutoImageProcessor

image_height, image_width = 1080, 1920

checkpoint = "facebook/dinov2-base"

# Create a new model with randomly initialized weights
model_config = Dinov2Config.from_pretrained(checkpoint, image_size=(image_height, image_width))
model = Dinov2ForImageClassification(model_config)
image_processor = AutoImageProcessor.from_pretrained(
    checkpoint, image_size={"height": image_heaight, "width": image_width}
)

Thank you very much @amyeroberts for your complete reply. Looking at the source code of Dinov2Config I found that the image_size parameter must be an int (and not a dict of heigth and width) : image_size (int, *optional*, defaults to 224): so that I don't know if we are allowed to pass rectangular images to this specific model. Moreover the AutoImageProcessor (which in our case is a BitImageProcessor) should accept as you said an input size : size: Dict[str, int] = None, and

if "shortest_edge" in size: size = size["shortest_edge"] default_to_square = False elif "height" in size and "width" in size: size = (size["height"], size["width"]) else: raise ValueError("Size must contain either 'shortest_edge' or 'height' and 'width'.") but when I try to instantiate it :

image_processor2= AutoImageProcessor.from_pretrained(checkpoint_name,                                                     
                                                      size={"height": 720, "width": 1080},
                                                      #size={"shortest_edge": 518},
                                                      do_center_crop=True, 
                                                      do_resize=True, 
                                                      do_rescale = True, 
                                                      do_normalize=True)

I get the following error :

File .../lib/python3.8/site-packages/transformers/models/bit/image_processing_bit.py:152, in BitImageProcessor.resize(self, image, size, resample, data_format, input_data_format, **kwargs)
     size = get_size_dict(size, default_to_square=False)
     if "shortest_edge" not in size:
-->      raise ValueError(f"The `size` parameter must contain the key `shortest_edge`. Got {size.keys()}")
    output_size = get_resize_output_image_size(
        image, size=size["shortest_edge"], default_to_square=False, input_data_format=input_data_format
     )

which is strange since in the corresponding doc : https://github.com/huggingface/transformers/blob/main/src/transformers/models/bit/image_processing_bit.py the error is different : raise ValueError("Size must contain either 'shortest_edge' or 'height' and 'width'.") Do you know where I'm wrong ? Sorry for bothering you but I'm a little bit lost :)

amyeroberts commented 6 months ago

Hi @lombardata,

image_size for the model is an int, but size for the image processors should be a dictionary. This is because, when images are passed to the model, their (h, w) dimensions are defined.

However, when processing an image, the output size isn't always fixed, the output height, width it can be calculated based on the input dimensions. For example size={"shortest_edge": s}, will resize the image so that the shortest edge of the image matches, "shortest_edge" and it will rescale the other edge to match the input aspect ratio.

With regards to the error you're encountering, which version of transformers are you running from? I was able to run the example snippet without error on the most recent version.

github-actions[bot] commented 5 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

lombardata commented 5 months ago

Hi @amyeroberts , thank you very much for your reply.

With regards to the error you're encountering, which version of transformers are you running from? I was able to run the example snippet without error on the most recent version.

I was running version : 4.34.1, now I upgraded to the latest and it's wirking fine. Thanks !

huggingface / transformers