Closed Byshev333 closed 8 months ago
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Sorry I did not have a look but the normalizer if of course in cause here. Not sure I'll have the time to debug this, @Narsil if anything comes to your mind!
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Recently I used
ByteLevelBPETokenizer
for tokenize training and setadd_prefix_space
toTrue
during the training process. Later I found that it is reasonable to addprefix_space
for English, but there is actually no need to addprefix_space
for Chinese, Japanese and Korean. So, I usetokenizer.normalizer = normalizers.Replace(pattern=tokenizers.Regex(r"^(?=\p{Latin})"), content=' ')
and setadd_prefix_space=False
to achieve the above function. But during the training process, an error was reported:How can we solve this problem? training code Thanks for your attention to this issue.