Closed JStumpp closed 1 day ago
LayoutLMv2Processor
currently only supports LayoutLMv2Tokenizer
/LayoutLMv2TokenizerFast
. It would be a good first issue to add support for a new LayoutXLMTokenizerFast
, which is based on XLMRoBERTa and takes into account the bounding box and word label inputs.
Hi @NielsRogge, I'd like to take a shot at this!
Great! So one would need to add tokenization_layoutxlm.py
and tokenization_layoutxlm_fast.py
to the LayoutLMv2 folder. These should be near identical copies of tokenization_xlm_roberta.py
and tokenization_xlm_roberta_fast.py
(found here), respectively, but with added support for boxes
and word_labels
inputs (you can take a look at tokenization_layoutlmv2.py
and tokenization_layoutlmv2_fast.py
respectively how these are implemented).
Great! So one would need to add
tokenization_layoutxlm.py
andtokenization_layoutxlm_fast.py
to the LayoutLMv2 folder. These should be near identical copies oftokenization_xlm_roberta.py
andtokenization_xlm_roberta_fast.py
(found here), respectively, but with added support forboxes
andword_labels
inputs (you can take a look attokenization_layoutlmv2.py
andtokenization_layoutlmv2_fast.py
respectively how these are implemented).
Thanks. Any advice on how I should go about writing the unit tests?
For the unit tests, I would define test_tokenization_layoutxlm
.py and test_tokenization_layoutxlm_fast.py
based on the corresponding tests of LayoutLMv2.
This issue has been fixed here right? Add LayoutXLMProcessor (and LayoutXLMTokenizer, LayoutXLMTokenizerFast) #14115
Thanks indeed, there's now a dedicated LayoutXLMProcessor, so closing this one.
Environment info
transformers
version: 4.11.3Who can help
@NielsRogge
Information
Model I am using: LayoutXLM
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
When we replace the layoutlmv2 tokenizer in cell 8 of https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv2/RVL-CDIP/Fine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb
with the layoutxlm tokenizer as described in https://huggingface.co/transformers/model_doc/layoutxlm.html
the following error occurs
It looks like the LayoutLMv2Processor does not accept the XLMRobertaTokenizerFast.
Expected behavior
That the LayoutLMv2Processor accepts the XLMRobertaTokenizerFast.