Closed jneuff closed 7 months ago
Mmm the token when added should be added at the end of the vocab not the beginning. In that case since the vocab is none it is probably erroring out. Is the actualy use case also with an empty vocab?
The actual use-case is not with an empty vocab. But the result is the same, the token is added to the added_tokens
and not to model.vocab
and we get the warning from above.
For a token that is just there to pad the vocab size, that makes sense. But that is probably not the most common reason to use add_special_token
. So the result seems really strange to me.
I'd like to understand:
model.vocab
?model.vocab
.Either way, when using the canonical methods to do this, I'd like to not end up with a tokenizer that produces warnings.
I tried to find out if there is any difference in tokenization/detokenization behavior depending on the added tokens being part of the model.vocab
or not. As far as I can see, it does not make any difference at all.
So my conclusion would be, that add_special_token
should definitely add the tokens to the model.vocab
.
Oh sorry I think I might understand what is going here. Because the index in the saved file is 0, and the length of the vocab is probably bigger, it's gonna look for vocab[0] and probably see that it is different (a different token). When adding tokens, they should usually be added at the end of the vocab. If you use tokenizer.add_tokens(["hey"])
the only way for the index of this token to be < vocab size is if it's already part of the vocab.
How did you add the tokens?
For the above example, I used this code:
use tokenizers::{AddedToken, Tokenizer};
fn main() {
let mut t = Tokenizer::from_file("./my-tokenizer.json").unwrap();
t.add_special_tokens(&[AddedToken::from("my-placeholder", true)]);
std::fs::write("./my-tokenizer.json", t.to_string(true).unwrap()).unwrap();
}
The vocab was empty before, so the token was not already present. With the tokenizer that actually caused this problem, the vocab was not empty before, but the token was definitely not already present.
So is this a bug of add_special_tokens
? Should it not add the token to the vocab as well?
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
We have a use-case where we have a trained tokenizer and later want to add some tokens to get the vocab size up to a certain number. If we use
add_tokens
oradd_special_tokens
we end up with something like this:When loading that tokenizer, e.g.
we get a warning:
What is your recommendation to get rid of that warning?
One thing I can do is also adding the token to the model.vocab. But 1. I don't know how to do this aside from manually editing the json file (which feels akward) and 2. conceptually not having an ID for that token makes sense for our use-case as it is only in their to get to a certain vocab size.