Open alem-147 opened 6 days ago
cc @qubvel @zucchini-nlp
To be able to pass 1-channel images, the image processor should have do_convert_rgb=True
so that PIL
images can be converted to 3-channel images. VIT doesn't have it, but i think we can add in a similar way as it is done in CLIP. Also the input_data_format
should be just None
, not a string "none"
@alem-147 would you like to open a PR to support RGB conversion in VIT image processor?
Hi @alem-147 @zucchini-nlp, a PR that adds do_convert_rgb
to ViT image processor was recently merged into the main
https://github.com/huggingface/transformers/pull/34523
To use the latest code please install transformers from the source
pip install -U git+https://github.com/huggingface/transformers
You can enable it as follows
teacher_processor = AutoImageProcessor.from_pretrained(
"farleyknight-org-username/vit-base-mnist", do_convert_rgb=True
)
processed_inputs = teacher_processor(examples["image"])
Thank you both for your comments. @zucchini-nlp None is the default value which has the shape be implied by the image processor. According to the docstring 'none' is meant to belong to a no channeled image. It seems that this should be handled inherently if 'none' is passed. I believe even in light of @qubvel 's message it is still necessary to fix this. I will look to get around to this at some point
def preprocess(
.
.
.
input_data_format (`ChannelDimension` or `str`, *optional*):
The channel dimension format for the input image. If unset, the channel dimension format is inferred
from the input image. Can be one of:
- `"channels_first"` or `ChannelDimension.FIRST`: image in (num_channels, height, width) format.
- `"channels_last"` or `ChannelDimension.LAST`: image in (height, width, num_channels) format.
- `"none"` or `ChannelDimension.NONE`: image in (height, width) format.
System Info
transformers version 4.46.3
Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
When using the mnist and dataset from
"ylecun/mnist"
and the finetuned VIT from"farleyknight-org-username/vit-base-mnist"
from the same dataset using the loaded image processor fails, not allowing the shape of the image. Code is adapted from the Knowledge Distillation for Computer VisionThe docstring for preprocess implies that this should be sufficient to allow for (height, width) format, but while resizing, there is only handling for ChannelDimension.First and ChannelDimension.Last, not for 'none' or ChannelDimension.None.
the code fails under the calls stack