Closed StellaAthena closed 9 months ago
Hey! Thanks for opening an issue π€
From a quick look (if I am wrong I'll deep dive of course) it seems that this should be resolved easily.
I don't know how you added the tokens, but the main difference between add_special_tokens
and add_tokens
is the default for the AddedToken
class. The default if you use add_special_token(["Hey"])
is to add the token as a AddedToken("Hey", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True))
. While if you add it with add_tokens(["Hey"])
it will add it as AddedToken("Hey", rstrip=False, lstrip=False, single_word=False, normalized=True, special=False)
.
Two easy things to check this: tokenizer.get_added_tokens_decoder()
should show the content of the added_vocab
with the specifics. If you are using transformers
's PretrainedTokenizerFast, then tokenizer.added_tokens_decoder
is a shortcut to access this.
I would suggest doing
>>> tokenizer.add_tokens(AddedToken("my_token", normalized = False, special = False))
instead of
>>> tokenizer.add_tokens("my_token")
really sorry if that is already what you are doing.
(If you are using Llama, this can make a huge difference as the normalized
adds the prefix space).
If you are using
transformers
's PretrainedTokenizerFast, thentokenizer.added_tokens_decoder
is a shortcut to access this. I would suggest doing>>> tokenizer.add_tokens(AddedToken("my_token", normalized = False, special = False))
instead of
>>> tokenizer.add_tokens("my_token")
Can you explain why this will make it so that when I then train my BPE tokenizer my_token
will not be merged with other tokens? It's not obvious to me that normalize=False
does that.
It depends on the tokenizer that you are using π Could you share this with me?
For Llama, the normalize=True
transformers the content of the tokens. So instead of 1
or 2
the tokens that are not gonna be split are first normalized, thus β1
and β2
will not be merged, but 1
and 2
will be, because when you split the input sequence, the normalizer is applied on the split, but β
is only added at the beginning.
So: Hey 123
-> [βHeyβ123
-> βHe
y
β1
2
3
]. In this list there is only one special token
It depends on the tokenizer that you are using π Could you share this with me? For Llama, the
normalize=True
transformers the content of the tokens. So instead of1
or2
the tokens that are not gonna be split are first normalized, thusβ1
andβ2
will not be merged, but1
and2
will be, because when you split the input sequence, the normalizer is applied on the split, butβ
is only added at the beginning.So:
Hey 123
-> [βHeyβ123
->βHe
y
β1
2
3
]. In this list there is only one special token
It's going to be largely similar to the GPT-2 tokenizer.
Then feel free to ping me again if this doesn't work, I'll try to help as best as I can
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I have a corpus I want to tokenize, and I know a priori certain tokens that should be in my vocabulary. It's important to me not only that they get tokenized as a single token but also that that token isn't later merged with other tokens. A good example of this is single digit numerical tokens. I want to seed the training process with
1
,2
,3
, etc. but I don't want the trainer to look at the large number of consecutive1
and2
tokens and combine them into a12
token.It looks like this behavior is supported, but only if my custom tokens are added with
add_special_token
. This will cause them to be ignored in some decoding contexts though, which I very much do not want. But if I just useadd_token
then the tokenizer may merge the tokens with other symbols while training.It seems like the best approximation right now would be to do:
However this isn't identical to the behavior I describe above.
I thought about just adding the tokens after training, but that also doesn't seem to work. In fact, that's how the GPT-NeoX tokenizer obtained the now-famous bug in which it tokenizes numbers as pairs of digits: the tokenizer has tokens for all singleton digits and all pairs of digits, and apparently the way the tiebreaking works causes it to turn
12345
into12
,34
,5
.