huggingface / tokenizers

πŸ’₯ Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.93k stars 779 forks source link

Add tokens not impacted by training #1380

Closed StellaAthena closed 9 months ago

StellaAthena commented 11 months ago

I have a corpus I want to tokenize, and I know a priori certain tokens that should be in my vocabulary. It's important to me not only that they get tokenized as a single token but also that that token isn't later merged with other tokens. A good example of this is single digit numerical tokens. I want to seed the training process with 1, 2, 3, etc. but I don't want the trainer to look at the large number of consecutive 1 and 2 tokens and combine them into a 12 token.

It looks like this behavior is supported, but only if my custom tokens are added with add_special_token. This will cause them to be ignored in some decoding contexts though, which I very much do not want. But if I just use add_token then the tokenizer may merge the tokens with other symbols while training.

It seems like the best approximation right now would be to do:

  1. Seed with custom tokens
  2. Train tokenizer
  3. Delete all tokens that contain custom tokens as a substring
  4. Re-add custom tokens

However this isn't identical to the behavior I describe above.

I thought about just adding the tokens after training, but that also doesn't seem to work. In fact, that's how the GPT-NeoX tokenizer obtained the now-famous bug in which it tokenizes numbers as pairs of digits: the tokenizer has tokens for all singleton digits and all pairs of digits, and apparently the way the tiebreaking works causes it to turn 12345 into 12,34, 5.

ArthurZucker commented 11 months ago

Hey! Thanks for opening an issue πŸ€— From a quick look (if I am wrong I'll deep dive of course) it seems that this should be resolved easily. I don't know how you added the tokens, but the main difference between add_special_tokens and add_tokens is the default for the AddedToken class. The default if you use add_special_token(["Hey"]) is to add the token as a AddedToken("Hey", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True)). While if you add it with add_tokens(["Hey"]) it will add it as AddedToken("Hey", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False).

Two easy things to check this: tokenizer.get_added_tokens_decoder() should show the content of the added_vocab with the specifics. If you are using transformers 's PretrainedTokenizerFast, then tokenizer.added_tokens_decoder is a shortcut to access this. I would suggest doing

>>> tokenizer.add_tokens(AddedToken("my_token", normalized = False, special = False))

instead of

>>> tokenizer.add_tokens("my_token")

really sorry if that is already what you are doing. (If you are using Llama, this can make a huge difference as the normalized adds the prefix space).

StellaAthena commented 11 months ago

If you are using transformers 's PretrainedTokenizerFast, then tokenizer.added_tokens_decoder is a shortcut to access this. I would suggest doing

>>> tokenizer.add_tokens(AddedToken("my_token", normalized = False, special = False))

instead of

>>> tokenizer.add_tokens("my_token")

Can you explain why this will make it so that when I then train my BPE tokenizer my_token will not be merged with other tokens? It's not obvious to me that normalize=False does that.

ArthurZucker commented 11 months ago

It depends on the tokenizer that you are using πŸ˜“ Could you share this with me? For Llama, the normalize=True transformers the content of the tokens. So instead of 1 or 2 the tokens that are not gonna be split are first normalized, thus ▁1 and ▁2 will not be merged, but 1 and 2 will be, because when you split the input sequence, the normalizer is applied on the split, but ▁ is only added at the beginning.

So: Hey 123 -> [▁Hey▁123 -> ▁He y ▁1 2 3]. In this list there is only one special token

StellaAthena commented 11 months ago

It depends on the tokenizer that you are using πŸ˜“ Could you share this with me? For Llama, the normalize=True transformers the content of the tokens. So instead of 1 or 2 the tokens that are not gonna be split are first normalized, thus ▁1 and ▁2 will not be merged, but 1 and 2 will be, because when you split the input sequence, the normalizer is applied on the split, but ▁ is only added at the beginning.

So: Hey 123 -> [▁Hey▁123 -> ▁He y ▁1 2 3]. In this list there is only one special token

It's going to be largely similar to the GPT-2 tokenizer.

ArthurZucker commented 11 months ago

Then feel free to ping me again if this doesn't work, I'll try to help as best as I can

github-actions[bot] commented 10 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.