huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
135.39k stars 27.09k forks source link

Error while saving a variation of roberta-base fast tokenizer vocabulary #13443

Closed ryparmar closed 3 years ago

ryparmar commented 3 years ago

Information

Unable to save 'ufal/robeczech-base' fast tokenizer, which is a variation of roberta. I have tried the same minimal example (see below) with non-fast tokenizer and it worked fine.

Error message with a RUST_BACKTRACE=1:

thread '<unnamed>' panicked at 'no entry found for key', /__w/tokenizers/tokenizers/tokenizers/src/models/mod.rs:36:66
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/std/src/panicking.rs:493:5
   1: core::panicking::panic_fmt
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/panicking.rs:92:14
   2: core::option::expect_failed
             at /rustc/9bc8c42bb2f19e745a63f3445f1ac248fb015e53/library/core/src/option.rs:1321:5
   3: serde::ser::Serializer::collect_map
   4: <tokenizers::models::bpe::model::BPE as tokenizers::tokenizer::Model>::save
   5: <tokenizers::models::ModelWrapper as tokenizers::tokenizer::Model>::save
   6: tokenizers::models::PyModel::save
   7: tokenizers::models::__init2250971146856332535::__init2250971146856332535::__wrap::{{closure}}
   8: tokenizers::models::__init2250971146856332535::__init2250971146856332535::__wrap
   9: _PyMethodDef_RawFastCallKeywords
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:694:23
  10: _PyCFunction_FastCallKeywords
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:734:14
  11: call_function
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:4568:9
  12: _PyEval_EvalFrameDefault
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3139:19
  13: _PyEval_EvalCodeWithName
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3930:14
  14: _PyFunction_FastCallKeywords
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:433:12
  15: call_function
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:4616:17
  16: _PyEval_EvalFrameDefault
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3139:19
  17: _PyEval_EvalCodeWithName
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3930:14
  18: _PyFunction_FastCallKeywords
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:433:12
  19: call_function
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:4616:17
  20: _PyEval_EvalFrameDefault
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3139:19
  21: _PyEval_EvalCodeWithName
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3930:14
  22: _PyFunction_FastCallKeywords
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Objects/call.c:433:12
  23: call_function
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:4616:17
  24: _PyEval_EvalFrameDefault
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3110:23
  25: _PyEval_EvalCodeWithName
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3930:14
  26: PyEval_EvalCodeEx
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:3959:12
  27: PyEval_EvalCode
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/ceval.c:524:12
  28: run_mod
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/pythonrun.c:1035:9
  29: PyRun_InteractiveOneObjectEx
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/pythonrun.c:256:9
  30: PyRun_InteractiveLoopFlags
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/pythonrun.c:120:15
  31: PyRun_AnyFileExFlags
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Python/pythonrun.c:78:19
  32: pymain_run_file
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:427:11
  33: pymain_run_filename
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:1606:22
  34: pymain_run_python
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:2867:9
  35: pymain_main
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:3028:5
  36: _Py_UnixMain
             at /tmp/eb-build/Python/3.7.4/GCCcore-8.3.0/Python-3.7.4/Modules/main.c:3063:12
  37: __libc_start_main
  38: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ryparmar/venv/NER/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2034, in save_pretrained
    filename_prefix=filename_prefix,
  File "/home/ryparmar/venv/NER/lib/python3.7/site-packages/transformers/tokenization_utils_fast.py", line 567, in _save_pretrained
    vocab_files = self.save_vocabulary(save_directory, filename_prefix=filename_prefix)
  File "/home/ryparmar/venv/NER/lib/python3.7/site-packages/transformers/models/gpt2/tokenization_gpt2_fast.py", line 177, in save_vocabulary
    files = self._tokenizer.model.save(save_directory, name=filename_prefix)
pyo3_runtime.PanicException: no entry found for key

Environment info

Who can help

@patrickvonplaten, @LysandreJik.

To reproduce

  1. Import model and tokenizer:

    from transformers import AutoTokenizer, AutoModelForMaskedLM  
    tokenizer = AutoTokenizer.from_pretrained("ufal/robeczech-base")  
    model = AutoModelForMaskedLM.from_pretrained("ufal/robeczech-base")
  2. Save the tokenizer:

    tokenizer.save_pretrained('./')
LysandreJik commented 3 years ago

There seems to be some tokens missing/non-consecutive tokens in the vocabulary of that tokenizer, causing the serialization to fail

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

sotander commented 3 years ago

The issue is still there. Is anyone able to save the tokenizer with: tokenizer.save_pretrained() ?

shudipta commented 2 years ago

Facing the exact problem while working with "sagorsarker/bangla-bert-base" using the same reproduce instruction provided by @ryparmar. And still could not solve this issue, even the root cause of this error.

shudipta commented 2 years ago

I found the the problem is from using fast tokenizer. so I turned it of using flag --use_fast_tokenizer=False, and it is ok. Though it is not solution I want.

patrickvonplaten commented 2 years ago

Hey guys,

At the moment, it seems like we will have to fall back to the slow tokenizer for this one:

  1. Import model and tokenizer:
from transformers import AutoTokenizer, AutoModelForMaskedLM  
tokenizer = AutoTokenizer.from_pretrained("ufal/robeczech-base", use_fast=False)
model = AutoModelForMaskedLM.from_pretrained("ufal/robeczech-base")
  1. Save the tokenizer:
    tokenizer.save_pretrained('./')

works.

foxik commented 1 year ago

Hi all,

I just committed a working fast tokenizer to the HF ufal/robeczech-base repository, in case it helps someone (but loading a fast tokenizer from the previous repository content was working too).

The reason why it cannot be saved is our own mistake (the authors of the ufal/robeczech-base model). During training, the subwords not present in the training data were left out from the dictionary, but ByteBPE requires the basic 256 subwords representing the 256 byte value, and some of the were left out. We therefore have multiple subwords mapped to id 3 (the id of [UNK] token), which seems to be working fine during loading, but not during saving (only one subword with id 3 is saved).

Sorry for the trouble...