LayoutLMv2Processor does not accept the XLMRobertaTokenizerFast

JStumpp commented 2 years ago

Environment info

transformers version: 4.11.3
Platform: Linux-4.19.128-microsoft-standard-x86_64-with-glibc2.2.5
Python version: 3.8.12
PyTorch version (GPU?): 1.9.1+cu102 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help

@NielsRogge

Information

Model I am using: LayoutXLM

The problem arises when using:

[x] the official example scripts: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb

The tasks I am working on is:

[x] an official task: SequenceClassification

To reproduce

Steps to reproduce the behavior:

When we replace the layoutlmv2 tokenizer in cell 8 of https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb

from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2Tokenizer, LayoutLMv2Processor
feature_extractor = LayoutLMv2FeatureExtractor()
tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
processor = LayoutLMv2Processor(feature_extractor, tokenizer)

with the layoutxlm tokenizer as described in https://huggingface.co/transformers/model_doc/layoutxlm.html

from transformers import LayoutLMv2FeatureExtractor, LayoutLMv2Tokenizer, LayoutLMv2Processor, AutoTokenizer
feature_extractor = LayoutLMv2FeatureExtractor()
tokenizer = AutoTokenizer.from_pretrained('microsoft/layoutxlm-base')
processor = LayoutLMv2Processor(feature_extractor, tokenizer)

the following error occurs

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/tmp/ipykernel_3433/3030379235.py in <module>
      5 tokenizer = AutoTokenizer.from_pretrained('microsoft/layoutxlm-base')
      6 #tokenizer = LayoutLMv2Tokenizer.from_pretrained("microsoft/layoutlmv2-base-uncased")
----> 7 processor = LayoutLMv2Processor(feature_extractor, tokenizer)

~/.cache/pypoetry/virtualenvs/stp-experiment0-RgVp7VCN-py3.8/lib/python3.8/site-packages/transformers/models/layoutlmv2/processing_layoutlmv2.py in __init__(self, feature_extractor, tokenizer)
     54             )
     55         if not isinstance(tokenizer, (LayoutLMv2Tokenizer, LayoutLMv2TokenizerFast)):
---> 56             raise ValueError(
     57                 f"`tokenizer` has to be of type {LayoutLMv2Tokenizer.__class__} or {LayoutLMv2TokenizerFast.__class__}, but is {type(tokenizer)}"
     58             )

ValueError: `tokenizer` has to be of type <class 'type'> or <class 'type'>, but is <class 'transformers.models.xlm_roberta.tokenization_xlm_roberta_fast.XLMRobertaTokenizerFast'>

It looks like the LayoutLMv2Processor does not accept the XLMRobertaTokenizerFast.

Expected behavior

That the LayoutLMv2Processor accepts the XLMRobertaTokenizerFast.

NielsRogge commented 2 years ago

LayoutLMv2Processor currently only supports LayoutLMv2Tokenizer/LayoutLMv2TokenizerFast. It would be a good first issue to add support for a new LayoutXLMTokenizerFast, which is based on XLMRoBERTa and takes into account the bounding box and word label inputs.

kingyiusuen commented 2 years ago

Hi @NielsRogge, I'd like to take a shot at this!

NielsRogge commented 2 years ago

Great! So one would need to add tokenization_layoutxlm.py and tokenization_layoutxlm_fast.py to the LayoutLMv2 folder. These should be near identical copies of tokenization_xlm_roberta.py and tokenization_xlm_roberta_fast.py (found here), respectively, but with added support for boxes and word_labels inputs (you can take a look at tokenization_layoutlmv2.py and tokenization_layoutlmv2_fast.py respectively how these are implemented).

kingyiusuen commented 2 years ago

Great! So one would need to add tokenization_layoutxlm.py and tokenization_layoutxlm_fast.py to the LayoutLMv2 folder. These should be near identical copies of tokenization_xlm_roberta.py and tokenization_xlm_roberta_fast.py (found here), respectively, but with added support for boxes and word_labels inputs (you can take a look at tokenization_layoutlmv2.py and tokenization_layoutlmv2_fast.py respectively how these are implemented).

Thanks. Any advice on how I should go about writing the unit tests?

NielsRogge commented 2 years ago

For the unit tests, I would define test_tokenization_layoutxlm.py and test_tokenization_layoutxlm_fast.py based on the corresponding tests of LayoutLMv2.

geetu040 commented 2 days ago

This issue has been fixed here right? Add LayoutXLMProcessor (and LayoutXLMTokenizer, LayoutXLMTokenizerFast) #14115

NielsRogge commented 1 day ago

Thanks indeed, there's now a dedicated LayoutXLMProcessor, so closing this one.

huggingface / transformers