huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.51k stars 26.9k forks source link

layoutlmv3-base-chinese tokenizer could not be loaded. #18307

Closed pogevip closed 2 years ago

pogevip commented 2 years ago

System Info

File ~/anaconda3/envs/paddle_env/lib/python3.8/site-packages/transformers/models/layoutlmv3/tokenization_layoutlmv3.py:325, in LayoutLMv3Tokenizer.__init__(self, vocab_file, merges_file, errors, bos_token, eos_token, sep_token, cls_token, unk_token, pad_token, mask_token, add_prefix_space, cls_token_box, sep_token_box, pad_token_box, pad_token_label, only_label_first_subword, **kwargs)
    305 mask_token = AddedToken(mask_token, lstrip=True, rstrip=False) if isinstance(mask_token, str) else mask_token
    307 super().__init__(
    308     errors=errors,
    309     bos_token=bos_token,
   (...)
    322     **kwargs,
    323 )
--> 325 with open(vocab_file, encoding="utf-8") as vocab_handle:
    326     self.encoder = json.load(vocab_handle)
    327 self.decoder = {v: k for k, v in self.encoder.items()}

TypeError: expected str, bytes or os.PathLike object, not NoneType

Who can help?

No response

Information

Tasks

Reproduction

none

Expected behavior

from transformers import AutoProcessor, AutoModel, XLMRobertaTokenizer, LayoutLMv3
chinese_processor = AutoProcessor.from_pretrained("./layoutlmv3_base_chinese", apply_ocr=False, local_files_only=True) 

But we seem need vocab.json and merges.txt to load the LayoutLMv3Tokenizer . So could you provide a function to convert them or confirm whether there is a diff between these two tokenizers?

1362802590 commented 2 years ago

The same problem, can you please solve it

moyans commented 2 years ago

The problem still exists.

processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False) It can run successfully,fail to load "microsoft/layoutlmv3-base-chinese", such as: processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base-chinese", apply_ocr=False)

TypeError: expected str, bytes or os.PathLike object, not NoneType

Version: 4.22.0.dev0 Summary: State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow Home-page: https://github.com/huggingface/transformers Author: The Hugging Face team (past and future) with the help of all our contributors

pogevip commented 2 years ago

As far as I know, transformers doesn't support chinese layoultlmv3, but unilm is OK. https://github.com/microsoft/unilm/tree/master/layoutlmv3

LIUYANZHI88 commented 2 years ago

As far as I know, transformers doesn't support chinese layoultlmv3, but unilm is OK. https://github.com/microsoft/unilm/tree/master/layoutlmv3

But I see it also requires vocab.json and merges.txt. I cannot load tokenizer either. https://github.com/microsoft/unilm/blob/master/layoutlmv3/layoutlmft/models/layoutlmv3/tokenization_layoutlmv3.py

1663659780(1)

How did you solve it, please?