Open Moddus opened 7 months ago
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Yes your analysis makes sense! TLDR the support for bytefallback training was not added, and we should do so!
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Hi,
I'm using tokenizers version
0.19.1
and would like to train a unigram tokenizer usingbyte_fallback
. Inspired by the unit test: https://github.com/huggingface/tokenizers/blob/main/bindings/python/tests/bindings/test_tokenizer.py#L434-L458 I created the following snippet:which outputs the following:
where I can find
"byte_fallback":false
. To check if byte fallback is happening:I cause if likely the
UnigramTrainer
indo_train
defaults tobyte_fallback
false in:Does the analysis makes sense?
Best