huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.89k stars 768 forks source link

ByteLevelBPE training error after adding normalizers.Replace #1348

Closed Byshev333 closed 8 months ago

Byshev333 commented 11 months ago

Recently I used ByteLevelBPETokenizer for tokenize training and set add_prefix_space to True during the training process. Later I found that it is reasonable to add prefix_space for English, but there is actually no need to add prefix_space for Chinese, Japanese and Korean. So, I use tokenizer.normalizer = normalizers.Replace(pattern=tokenizers.Regex(r"^(?=\p{Latin})"), content=' ')and set add_prefix_space=False to achieve the above function. But during the training process, an error was reported:

Traceback (most recent call last):
  File "trainer_bbpe_kwai.py", line 119, in <module>
    tokenizer.train(
  File "/share/miniconda3/envs/hf_tokenizers/lib/python3.8/site-packages/tokenizers/implementations/byte_level_bpe.py", line 98, in train
    self._tokenizer.train(files, trainer=trainer)
pyo3_runtime.PanicException: index out of bounds: the len is 39 but the index is 39
thread '<unnamed>' panicked at 'index out of bounds: the len is 35 but the index is 35', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
thread '<unnamed>' panicked at 'index out of bounds: the len is 115 but the index is 115', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
thread '<unnamed>' panicked at 'index out of bounds: the len is 16 but the index is 16', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
thread '<unnamed>' panicked at 'index out of bouThanks for your attention to this matter.nds: the len is 4 but the index is 4', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
thread '<unnamed>' panicked at 'index out of bounds: the len is 4 but the index is 4', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
thread '<unnamed>' panicked at 'index out of bounds: the len is 32 but the index is 32', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
thread '<unnamed>' panicked at 'index out of bounds: the len is 13 but the index is 13', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
thread '<unnamed>' panicked at 'index out of bounds: the len is 7 but the index is 7', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21
thread '<unnamed>' panicked at 'index out of bounds: the len is 2 but the index is 2', /home/runner/work/tokenizers/tokenizers/tokenizers/src/tokenizer/normalizer.rs:382:21

How can we solve this problem? training code Thanks for your attention to this issue.

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker commented 9 months ago

Sorry I did not have a look but the normalizer if of course in cause here. Not sure I'll have the time to debug this, @Narsil if anything comes to your mind!

github-actions[bot] commented 8 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.