huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.51k stars 27.12k forks source link

HW based PIL images not being handled in pretrained image processor #34820

Open alem-147 opened 6 days ago

alem-147 commented 6 days ago

System Info

transformers version 4.46.3

Who can help?

No response

Information

Tasks

Reproduction

from datasets import load_dataset
import numpy as np

dataset = load_dataset("ylecun/mnist")
teacher_processor = AutoImageProcessor.from_pretrained("farleyknight-org-username/vit-base-mnist")

def process(examples):
    processed_inputs =  teacher_processor(examples["image"], input_data_format="none")`
    return processed_inputs

Expected behavior

When using the mnist and dataset from "ylecun/mnist" and the finetuned VIT from "farleyknight-org-username/vit-base-mnist" from the same dataset using the loaded image processor fails, not allowing the shape of the image. Code is adapted from the Knowledge Distillation for Computer Vision

The docstring for preprocess implies that this should be sufficient to allow for (height, width) format, but while resizing, there is only handling for ChannelDimension.First and ChannelDimension.Last, not for 'none' or ChannelDimension.None.

the code fails under the calls stack

Function File Line
to_channel_dimension_format image_transforms.py 93
to_pil_image image_transforms.py 204
resize image_transforms.py 338
resize image_processing_vit.py 138
preprocess image_processing_vit.py 250
Rocketknight1 commented 6 days ago

cc @qubvel @zucchini-nlp

zucchini-nlp commented 5 days ago

To be able to pass 1-channel images, the image processor should have do_convert_rgb=True so that PIL images can be converted to 3-channel images. VIT doesn't have it, but i think we can add in a similar way as it is done in CLIP. Also the input_data_format should be just None, not a string "none"

@alem-147 would you like to open a PR to support RGB conversion in VIT image processor?

qubvel commented 5 days ago

Hi @alem-147 @zucchini-nlp, a PR that adds do_convert_rgb to ViT image processor was recently merged into the main

alem-147 commented 5 days ago

Thank you both for your comments. @zucchini-nlp None is the default value which has the shape be implied by the image processor. According to the docstring 'none' is meant to belong to a no channeled image. It seems that this should be handled inherently if 'none' is passed. I believe even in light of @qubvel 's message it is still necessary to fix this. I will look to get around to this at some point

  def preprocess(
    .
    .
    .
    input_data_format (`ChannelDimension` or `str`, *optional*):
        The channel dimension format for the input image. If unset, the channel dimension format is inferred
        from the input image. Can be one of:
        - `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
        - `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
        - `"none"` or `ChannelDimension.NONE`: image in (height, width) format.