ValueError: Unsupported number of image dimensions: 2 - An error during embedding Image data

UmarIgan commented 1 year ago

System Info

I am facing an issue during encoding image dataset using facebook/dino-vits16, I faced this issue with grayscale images before too but it worked well with Bingsu/Human_Action_Recognition dataset. Versions

transformers==4.32.0
torch==2.0.1+cu118
datasets==2.14.4

The error:

Some weights of ViTModel were not initialized from the model checkpoint at facebook/dino-vits16 and are newly initialized: ['pooler.dense.weight', 'pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Map: 0%
2/10000 [00:00<40:18, 4.13 examples/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[<ipython-input-30-0547920c10ef>](https://localhost:8080/#) in <cell line: 22>()
     20     return batch
     21 
---> 22 dataset_train = dataset_train.map(get_embeddings)

8 frames
[/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    590             self: "Dataset" = kwargs.pop("self")
    591         # apply actual function
--> 592         out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
    593         datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
    594         for dataset in datasets:

[/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
    555         }
    556         # apply actual function
--> 557         out: Union["Dataset", "DatasetDict"] = func(self, *args, **kwargs)
    558         datasets: List["Dataset"] = list(out.values()) if isinstance(out, dict) else [out]
    559         # re-apply format to the output

[/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py](https://localhost:8080/#) in map(self, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, load_from_cache_file, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, num_proc, suffix_template, new_fingerprint, desc)
   3095                     desc=desc or "Map",
   3096                 ) as pbar:
-> 3097                     for rank, done, content in Dataset._map_single(**dataset_kwargs):
   3098                         if done:
   3099                             shards_done += 1

[/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py](https://localhost:8080/#) in _map_single(shard, function, with_indices, with_rank, input_columns, batched, batch_size, drop_last_batch, remove_columns, keep_in_memory, cache_file_name, writer_batch_size, features, disable_nullable, fn_kwargs, new_fingerprint, rank, offset)
   3448                     _time = time.time()
   3449                     for i, example in shard_iterable:
-> 3450                         example = apply_function_on_filtered_inputs(example, i, offset=offset)
   3451                         if update_data:
   3452                             if i == 0:

[/usr/local/lib/python3.10/dist-packages/datasets/arrow_dataset.py](https://localhost:8080/#) in apply_function_on_filtered_inputs(pa_inputs, indices, check_same_num_examples, offset)
   3351             if with_rank:
   3352                 additional_args += (rank,)
-> 3353             processed_inputs = function(*fn_args, *additional_args, **fn_kwargs)
   3354             if isinstance(processed_inputs, LazyDict):
   3355                 processed_inputs = {

[<ipython-input-30-0547920c10ef>](https://localhost:8080/#) in get_embeddings(batch)
     14 
     15 def get_embeddings(batch):
---> 16     inputs = processor(images=batch['image'], return_tensors="pt").to(device)
     17     with torch.no_grad():
     18         outputs = model(**inputs).last_hidden_state.mean(dim=1).cpu().numpy()

[/usr/local/lib/python3.10/dist-packages/transformers/image_processing_utils.py](https://localhost:8080/#) in __call__(self, images, **kwargs)
    544     def __call__(self, images, **kwargs) -> BatchFeature:
    545         """Preprocess an image or a batch of images."""
--> 546         return self.preprocess(images, **kwargs)
    547 
    548     def preprocess(self, images, **kwargs) -> BatchFeature:

[/usr/local/lib/python3.10/dist-packages/transformers/models/vit/image_processing_vit.py](https://localhost:8080/#) in preprocess(self, images, do_resize, size, resample, do_rescale, rescale_factor, do_normalize, image_mean, image_std, return_tensors, data_format, input_data_format, **kwargs)
    232         if input_data_format is None:
    233             # We assume that all images have the same channel dimension format.
--> 234             input_data_format = infer_channel_dimension_format(images[0])
    235 
    236         if do_resize:

[/usr/local/lib/python3.10/dist-packages/transformers/image_utils.py](https://localhost:8080/#) in infer_channel_dimension_format(image, num_channels)
    168         first_dim, last_dim = 1, 3
    169     else:
--> 170         raise ValueError(f"Unsupported number of image dimensions: {image.ndim}")
    171 
    172     if image.shape[first_dim] in num_channels:

ValueError: Unsupported number of image dimensions: 2

Who can help?

@amyeroberts

Information

[x] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

from transformers import ViTImageProcessor, ViTModel
from datasets import load_dataset, Dataset
import torch

dataset_train = load_dataset(
    'ashraq/fashion-product-images-small', split='train[:10000]'
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
processor = ViTImageProcessor.from_pretrained('facebook/dino-vits16')
model = ViTModel.from_pretrained('facebook/dino-vits16')

def get_embeddings(batch):
    inputs = processor(images=batch['image'], return_tensors="pt").to(device)
    with torch.no_grad():
        outputs = model(**inputs).last_hidden_state.mean(dim=1).cpu().numpy()
    batch['embeddings'] = outputs
    return batch

dataset_train = dataset_train.map(get_embeddings)

Expected behavior

Expected behavior was to obtaining embeddings.

ArthurZucker commented 1 year ago

cc @amyeroberts and @rafaelpadilla

rafaelpadilla commented 1 year ago

Hi @UmarIgan

Thank you for bringing this to our attention!

I've tested your code and indeed, I've encountered the same error. I'm on it and will work towards a solution.

UmarIgan commented 1 year ago

Thanks @rafaelpadilla As I understand vision transformers also can't encode grayscale of images as well, I tried to wrap around the dataset - tried to transform image to add a new channel but no go. Is there a way to overcome this?

ajay-f22 commented 4 months ago

I was facing the same error. Fixed this by converting images to RGB mode. image = image.convert('RGB')

Kkordik commented 4 months ago

I was facing the same error. Fixed this by converting images to RGB mode. image = image.convert('RGB')

Thank you, works for me!

codybum commented 1 month ago

But if you convert from 16bit single channel to 8bit RGB, you will lose resolution. It does not seem like a solution.

huggingface / transformers