Open kitkhai opened 11 months ago
Hey! Thanks for raising the issue. This is pretty much a duplicated of #26318 and will be fixed by #27883! You can already work around this following the tutorial here: https://github.com/huggingface/tokenizers/pull/1357
PR Was merged, let me check if this is fixed!
Okay not fixed yet I' ll include it in #27717
Hi @ArthurZucker excited to hear about the progress! :)
Also, I realised that if there are multiple added tokens (abcd
& gyma
) that could in a sense overlap and be segmented from the same word gymabcd
. It seems like the current implementation seems to just go from left to right and look for any added token that appears first.
Instead of doing a left to right, is there a way to control which added token should take precedence and be segmented first?
Just a reminder that I am thinking of Chinese language where words are not separated by space and hence my seemingly weird example of gymabcd
There is no real way to do that yet, I think we check the longest first.
That is not fixed yet, but can be fixed kind of manually if we follow what was done for SpmConverters:
def pre_tokenizer(self, replacement, add_prefix_space):
prepend_scheme = "always"
if hasattr(self.original_tokenizer, "legacy") and not self.original_tokenizer.legacy:
prepend_scheme = "first"
return pre_tokenizers.Metaspace(
replacement=replacement, add_prefix_space=add_prefix_space, prepend_scheme=prepend_scheme
)
setting legacy to False
for the fast tokenizer.
I'll add this as a good difficult issue, as this should be similar to #26678
System Info
transformers
version: 4.35.2Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
The output from my code:
The original post where I raised this potential bug and was asked to file an issue would be at: https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564/5
For context, I am originally trying to add Chinese tokens to the tokenizer. However, for illustration purposes, I have demonstrated the “bug” in English. Chinese words are not separated by spaces and hence in the example you will see me trying to add a token that is a subword.
Evidently, tokenizer.add_tokens() works well if there will always be space after the added token but it doesn’t work as intended if there isn’t space after the added token (where the tokenizer will then introduce the additional space on its own).
I read the docs and figured out it is probably because the added tokens are isolated before the tokenization algorithm is applied, hence I am not 100% sure this behaviour by the tokenizer is intended.