HW based PIL images not being handled in pretrained image processor

alem-147 commented 6 days ago

System Info

transformers version 4.46.3

Who can help?

No response

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

from datasets import load_dataset
import numpy as np

dataset = load_dataset("ylecun/mnist")
teacher_processor = AutoImageProcessor.from_pretrained("farleyknight-org-username/vit-base-mnist")

def process(examples):
    processed_inputs =  teacher_processor(examples["image"], input_data_format="none")`
    return processed_inputs

Expected behavior

When using the mnist and dataset from "ylecun/mnist" and the finetuned VIT from "farleyknight-org-username/vit-base-mnist" from the same dataset using the loaded image processor fails, not allowing the shape of the image. Code is adapted from the Knowledge Distillation for Computer Vision

The docstring for preprocess implies that this should be sufficient to allow for (height, width) format, but while resizing, there is only handling for ChannelDimension.First and ChannelDimension.Last, not for 'none' or ChannelDimension.None.

the code fails under the calls stack

Function	File	Line
to_channel_dimension_format	image_transforms.py	93
to_pil_image	image_transforms.py	204
resize	image_transforms.py	338
resize	image_processing_vit.py	138
preprocess	image_processing_vit.py	250

Rocketknight1 commented 6 days ago

cc @qubvel @zucchini-nlp

zucchini-nlp commented 5 days ago

To be able to pass 1-channel images, the image processor should have do_convert_rgb=True so that PIL images can be converted to 3-channel images. VIT doesn't have it, but i think we can add in a similar way as it is done in CLIP. Also the input_data_format should be just None, not a string "none"

@alem-147 would you like to open a PR to support RGB conversion in VIT image processor?

qubvel commented 5 days ago

Hi @alem-147 @zucchini-nlp, a PR that adds do_convert_rgb to ViT image processor was recently merged into the main

https://github.com/huggingface/transformers/pull/34523

To use the latest code please install transformers from the source

pip install -U git+https://github.com/huggingface/transformers

You can enable it as follows

teacher_processor = AutoImageProcessor.from_pretrained(
    "farleyknight-org-username/vit-base-mnist", do_convert_rgb=True
)
processed_inputs = teacher_processor(examples["image"])

alem-147 commented 5 days ago

Thank you both for your comments. @zucchini-nlp None is the default value which has the shape be implied by the image processor. According to the docstring 'none' is meant to belong to a no channeled image. It seems that this should be handled inherently if 'none' is passed. I believe even in light of @qubvel 's message it is still necessary to fix this. I will look to get around to this at some point

  def preprocess(
    .
    .
    .
    input_data_format (`ChannelDimension` or `str`, *optional*):
        The channel dimension format for the input image. If unset, the channel dimension format is inferred
        from the input image. Can be one of:
        - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
        - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
        - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.

huggingface / transformers