ValueError from NougatImageProcessor using example from docs

lucasmccabe commented 11 months ago

System Info

transformers version: 4.34.0
Platform: macOS-13.5.2-arm64-arm-64bit
Python version: 3.10.12
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.2
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.0.1 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@amyeroberts @ArthurZucker

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

I am running the example code from the Nougat documentation here in a Jupyter notebook:

from huggingface_hub import hf_hub_download
import re
from PIL import Image

from transformers import NougatProcessor, VisionEncoderDecoderModel
from datasets import load_dataset
import torch

processor = NougatProcessor.from_pretrained("facebook/nougat-base")
model = VisionEncoderDecoderModel.from_pretrained("facebook/nougat-base")

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
# prepare PDF image for the model
filepath = hf_hub_download(repo_id="hf-internal-testing/fixtures_docvqa", filename="nougat_paper.png", repo_type="dataset")
image = Image.open(filepath)
pixel_values = processor(image, return_tensors="pt").pixel_values

# generate transcription (here we only generate 30 tokens)
outputs = model.generate(
    pixel_values.to(device),
    min_length=1,
    max_new_tokens=30,
    bad_words_ids=[[processor.tokenizer.unk_token_id]],
)

sequence = processor.batch_decode(outputs, skip_special_tokens=True)[0]
sequence = processor.post_process_generation(sequence, fix_markdown=False)
# note: we're using repr here such for the sake of printing the \n characters, feel free to just print the sequence
print(repr(sequence))

Expected behavior

Expected Output

The documentation indicates the above should return the following string:

'\n\n# Nougat: Neural Optical Understanding for Academic Documents\n\n Lukas Blecher\n\nCorrespondence to: lblecher@'

Observed Output

Running the following as a code block raises the following:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[1], line 17
     15 filepath = hf_hub_download(repo_id="hf-internal-testing/fixtures_docvqa", filename="nougat_paper.png", repo_type="dataset")
     16 image = Image.open(filepath)
---> 17 pixel_values = processor(image, return_tensors="pt").pixel_values
     19 # generate transcription (here we only generate 30 tokens)
     20 outputs = model.generate(
     21     pixel_values.to(device),
     22     min_length=1,
     23     max_new_tokens=30,
     24     bad_words_ids=[[processor.tokenizer.unk_token_id]],
     25 )

File /opt/miniconda3/envs/dogger-dev/lib/python3.10/site-packages/transformers/models/nougat/processing_nougat.py:91, in NougatProcessor.__call__(self, images, text, do_crop_margin, do_resize, size, resample, do_thumbnail, do_align_long_axis, do_pad, do_rescale, rescale_factor, do_normalize, image_mean, image_std, data_format, input_data_format, text_pair, text_target, text_pair_target, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
     88     raise ValueError("You need to specify either an `images` or `text` input to process.")
     90 if images is not None:
---> 91     inputs = self.image_processor(
     92         images,
     93         do_crop_margin=do_crop_margin,
     94         do_resize=do_resize,
     95         size=size,
     96         resample=resample,
     97         do_thumbnail=do_thumbnail,
     98         do_align_long_axis=do_align_long_axis,
     99         do_pad=do_pad,
    100         do_rescale=do_rescale,
    101         rescale_factor=rescale_factor,
    102         do_normalize=do_normalize,
    103         image_mean=image_mean,
    104         image_std=image_std,
    105         return_tensors=return_tensors,
    106         data_format=data_format,
    107         input_data_format=input_data_format,
    108     )
    109 if text is not None:
    110     encodings = self.tokenizer(
    111         text,
    112         text_pair=text_pair,
   (...)
    129         verbose=verbose,
    130     )

File /opt/miniconda3/envs/dogger-dev/lib/python3.10/site-packages/transformers/image_processing_utils.py:546, in BaseImageProcessor.__call__(self, images, **kwargs)
    544 def __call__(self, images, **kwargs) -> BatchFeature:
    545     """Preprocess an image or a batch of images."""
--> 546     return self.preprocess(images, **kwargs)

File /opt/miniconda3/envs/dogger-dev/lib/python3.10/site-packages/transformers/models/nougat/image_processing_nougat.py:505, in NougatImageProcessor.preprocess(self, images, do_crop_margin, do_resize, size, resample, do_thumbnail, do_align_long_axis, do_pad, do_rescale, rescale_factor, do_normalize, image_mean, image_std, return_tensors, data_format, input_data_format, **kwargs)
    499 if do_normalize:
    500     images = [
    501         self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
    502         for image in images
    503     ]
--> 505 images = [
    506     to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
    507 ]
    509 data = {"pixel_values": images}
    510 return BatchFeature(data=data, tensor_type=return_tensors)

File /opt/miniconda3/envs/dogger-dev/lib/python3.10/site-packages/transformers/models/nougat/image_processing_nougat.py:506, in <listcomp>(.0)
    499 if do_normalize:
    500     images = [
    501         self.normalize(image=image, mean=image_mean, std=image_std, input_data_format=input_data_format)
    502         for image in images
    503     ]
    505 images = [
--> 506     to_channel_dimension_format(image, data_format, input_channel_dim=input_data_format) for image in images
    507 ]
    509 data = {"pixel_values": images}
    510 return BatchFeature(data=data, tensor_type=return_tensors)

File /opt/miniconda3/envs/dogger-dev/lib/python3.10/site-packages/transformers/image_transforms.py:78, in to_channel_dimension_format(image, channel_dim, input_channel_dim)
     75 if input_channel_dim is None:
     76     input_channel_dim = infer_channel_dimension_format(image)
---> 78 target_channel_dim = ChannelDimension(channel_dim)
     79 if input_channel_dim == target_channel_dim:
     80     return image

File /opt/miniconda3/envs/dogger-dev/lib/python3.10/enum.py:385, in EnumMeta.__call__(cls, value, names, module, qualname, type, start)
    360 """
    361 Either returns an existing member, or creates a new enum class.
    362 
   (...)
    382 `type`, if set, will be mixed in as the first base class.
    383 """
    384 if names is None:  # simple value lookup
--> 385     return cls.__new__(cls, value)
    386 # otherwise, functional API: we're creating a new Enum type
    387 return cls._create_(
    388         value,
    389         names,
   (...)
    393         start=start,
    394         )

File /opt/miniconda3/envs/dogger-dev/lib/python3.10/enum.py:718, in Enum.__new__(cls, value)
    716         if not isinstance(exc, ValueError):
    717             exc.__context__ = ve_exc
--> 718         raise exc
    719 finally:
    720     # ensure all variables that could hold an exception are destroyed
    721     exc = None

File /opt/miniconda3/envs/dogger-dev/lib/python3.10/enum.py:700, in Enum.__new__(cls, value)
    698 try:
    699     exc = None
--> 700     result = cls._missing_(value)
    701 except Exception as e:
    702     exc = e

File /opt/miniconda3/envs/dogger-dev/lib/python3.10/site-packages/transformers/utils/generic.py:433, in ExplicitEnum._missing_(cls, value)
    431 @classmethod
    432 def _missing_(cls, value):
--> 433     raise ValueError(
    434         f"{value} is not a valid {cls.__name__}, please select one of {list(cls._value2member_map_.keys())}"
    435     )

ValueError: ChannelDimension.FIRST is not a valid ChannelDimension, please select one of ['channels_first', 'channels_last']

lucasmccabe commented 11 months ago

Note: this is resolved by explicitly passing data_format="channels_first" when calling NougatProcessor, although it's not clear to me why this is the case, since this is ostensibly the default anyway.

ArthurZucker commented 11 months ago

Hello, thanks for reporting, this was fixed in #26608

github-actions[bot] commented 10 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

huggingface / transformers