huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
9.03k stars 797 forks source link

pyo3_runtime.PanicException: Missing additional token #876

Closed chinoll closed 3 months ago

chinoll commented 2 years ago

python code:

from tokenizers import Tokenizer
tokenizer = Tokenizer.from_pretrained("bert-base-chinese")
tokenizer.train(["001.txt"])

error:

thread '<unnamed>' panicked at 'Missing additional token', /__w/tokenizers/tokenizers/tokenizers/src/tokenizer/added_vocabulary.rs:293:26
stack backtrace:
   0: rust_begin_unwind
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:517:5
   1: core::panicking::panic_fmt
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/panicking.rs:100:14
   2: core::panicking::panic_display
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/panicking.rs:64:5
   3: core::option::expect_failed
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/option.rs:1638:5
   4: core::ops::function::impls::<impl core::ops::function::FnMut<A> for &mut F>::call_mut
   5: <core::iter::adapters::chain::Chain<A,B> as core::iter::traits::iterator::Iterator>::fold
   6: tokenizers::tokenizer::added_vocabulary::AddedVocabulary::add_tokens
   7: tokenizers::utils::iter::ResultShunt<I,E>::process
   8: tokenizers::tokenizer::TokenizerImpl<M,N,PT,PP,D>::train_from_files
   9: pyo3::python::Python::allow_threads
  10: tokenizers::tokenizer::PyTokenizer::train
  11: tokenizers::tokenizer::__init6664333195501329398::__wrap::{{closure}}
  12: tokenizers::tokenizer::__init6664333195501329398::__wrap
  13: method_vectorcall_VARARGS_KEYWORDS
             at /tmp/build/80754af9/python-split_1631797238431/work/Objects/descrobject.c:346
  14: _PyObject_VectorcallTstate
             at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:118
  15: PyObject_Vectorcall
             at /tmp/build/80754af9/python-split_1631797238431/work/Include/cpython/abstract.h:127
  16: call_function
             at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:5075
  17: _PyEval_EvalFrameDefault
             at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:3504
  18: _PyEval_EvalFrame
             at /tmp/build/80754af9/python-split_1631797238431/work/Include/internal/pycore_ceval.h:40
  19: _PyEval_EvalCode
             at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:4327
  20: _PyEval_EvalCodeWithName
             at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:4359
  21: PyEval_EvalCodeEx
             at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:4375
  22: PyEval_EvalCode
             at /tmp/build/80754af9/python-split_1631797238431/work/Python/ceval.c:826
  23: run_eval_code_obj
             at /tmp/build/80754af9/python-split_1631797238431/work/Python/pythonrun.c:1219
  24: run_mod
             at /tmp/build/80754af9/python-split_1631797238431/work/Python/pythonrun.c:1240
  25: PyRun_InteractiveOneObjectEx
             at /tmp/build/80754af9/python-split_1631797238431/work/Python/pythonrun.c:273
  26: PyRun_InteractiveLoopFlags
             at /tmp/build/80754af9/python-split_1631797238431/work/Python/pythonrun.c:126
  27: PyRun_AnyFileExFlags
             at /tmp/build/80754af9/python-split_1631797238431/work/Python/pythonrun.c:85
  28: pymain_run_stdin
             at /tmp/build/80754af9/python-split_1631797238431/work/Modules/main.c:518
  29: pymain_run_python
             at /tmp/build/80754af9/python-split_1631797238431/work/Modules/main.c:607
  30: Py_RunMain
             at /tmp/build/80754af9/python-split_1631797238431/work/Modules/main.c:683
  31: Py_BytesMain
             at /tmp/build/80754af9/python-split_1631797238431/work/Modules/main.c:1129
  32: __libc_start_main
  33: <unknown>
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
pyo3_runtime.PanicException: Missing additional token
Narsil commented 2 years ago

@chinoll I confirm this issue.

It is linked to inconsistency in how the trainer is automatically called from your code.

A quick fix for you use case is this:

from tokenizers import Tokenizer, trainers

tokenizer = Tokenizer.from_pretrained("bert-base-chinese")
tokenizer.save("tok.json")
special_tokens = [
    "[PAD]",
    "[UNK]",
    "[CLS]",
    "[SEP]",
    "[MASK]",
]
trainer = trainers.WordPieceTrainer(
    special_tokens=special_tokens,
)
tokenizer.train(["data/big.txt"], trainer=trainer)

Basically simply add the special_tokens back to the trainer.

Please note that since your are retraining, the ids of those tokens will probably change (they are ID 0, 100, 101-103) in the original bert-base-chinese)

chinoll commented 2 years ago

@chinoll I confirm this issue.

It is linked to inconsistency in how the trainer is automatically called from your code.

A quick fix for you use case is this:

from tokenizers import Tokenizer, trainers

tokenizer = Tokenizer.from_pretrained("bert-base-chinese")
tokenizer.save("tok.json")
special_tokens = [
    "[PAD]",
    "[UNK]",
    "[CLS]",
    "[SEP]",
    "[MASK]",
]
trainer = trainers.WordPieceTrainer(
    special_tokens=special_tokens,
)
tokenizer.train(["data/big.txt"], trainer=trainer)

Basically simply add the special_tokens back to the trainer.

Please note that since your are retraining, the ids of those tokens will probably change (they are ID 0, 100, 101-103) in the original bert-base-chinese)

Thanks, this solved my problem

Narsil commented 2 years ago

I will reopen this issue if you don't mind since the easy fix, works but is not the end of it.

IMHO, the code you submitted should work out of the box, at least no panic should ever occur, a regular exception might be acceptable.

This issue will help when work starts on this tokenizer/trainer information flow that we need to fix and/or tokenizer fine-tuning. A lot of issues seem to be popping in this area and fixing those really (not with a modified script that requires expert knowledge) seems like a good idea.

Glad it solved your problem though.

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.