Tokenizer adds an additional space after the added token

kitkhai commented 11 months ago

System Info

transformers version: 4.35.2
Platform: Linux-6.1.58+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.19.4
Safetensors version: 0.4.1
Accelerate version: 0.25.0
Accelerate config: not found
PyTorch version (GPU?): 2.1.0+cu121 (False)
Tensorflow version (GPU?): 2.15.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.7.5 (cpu)
Jax version: 0.4.23
JaxLib version: 0.4.23
Using GPU in script?: no
Using distributed or parallel set-up in script?: no

Who can help?

@ArthurZucker

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer

checkpoint = "facebook/nllb-200-distilled-600M"
tokenizer = AutoTokenizer.from_pretrained(checkpoint, src_lang = "eng_Latn", tgt_lang = "zho_Hans")
tokenizer.add_tokens(["abcd"])

sent = 'I like to walk abcdgym along the beach'
print("tokenizer: ", tokenizer.tokenize(sent))
print("tokenizer: ", tokenizer.decode(tokenizer.encode(sent)[1:-1]))

sent = 'I like to walk gymabcd along the beach'
print("tokenizer: ", tokenizer.tokenize(sent))
print("tokenizer: ", tokenizer.decode(tokenizer.encode(sent)[1:-1]))

Expected behavior

The output from my code:

The original post where I raised this potential bug and was asked to file an issue would be at: https://discuss.huggingface.co/t/tokenizer-shrinking-recipes/8564/5

For context, I am originally trying to add Chinese tokens to the tokenizer. However, for illustration purposes, I have demonstrated the “bug” in English. Chinese words are not separated by spaces and hence in the example you will see me trying to add a token that is a subword.

Evidently, tokenizer.add_tokens() works well if there will always be space after the added token but it doesn’t work as intended if there isn’t space after the added token (where the tokenizer will then introduce the additional space on its own).

I read the docs and figured out it is probably because the added tokens are isolated before the tokenization algorithm is applied, hence I am not 100% sure this behaviour by the tokenizer is intended.

ArthurZucker commented 10 months ago

Hey! Thanks for raising the issue. This is pretty much a duplicated of #26318 and will be fixed by #27883! You can already work around this following the tutorial here: https://github.com/huggingface/tokenizers/pull/1357

ArthurZucker commented 9 months ago

PR Was merged, let me check if this is fixed!

ArthurZucker commented 9 months ago

Okay not fixed yet I' ll include it in #27717

kitkhai commented 9 months ago

Hi @ArthurZucker excited to hear about the progress! :)

Also, I realised that if there are multiple added tokens (abcd& gyma) that could in a sense overlap and be segmented from the same word gymabcd. It seems like the current implementation seems to just go from left to right and look for any added token that appears first. Instead of doing a left to right, is there a way to control which added token should take precedence and be segmented first?

Just a reminder that I am thinking of Chinese language where words are not separated by space and hence my seemingly weird example of gymabcd

ArthurZucker commented 9 months ago

There is no real way to do that yet, I think we check the longest first.

ArthurZucker commented 8 months ago

That is not fixed yet, but can be fixed kind of manually if we follow what was done for SpmConverters:

    def pre_tokenizer(self, replacement, add_prefix_space):
        prepend_scheme = "always"
        if hasattr(self.original_tokenizer, "legacy") and not self.original_tokenizer.legacy:
            prepend_scheme = "first"
        return pre_tokenizers.Metaspace(
            replacement=replacement, add_prefix_space=add_prefix_space, prepend_scheme=prepend_scheme
        )

setting legacy to False for the fast tokenizer. I'll add this as a good difficult issue, as this should be similar to #26678

huggingface / transformers