huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.69k stars 747 forks source link

Warnings for added tokens not present in the vocab #1366

Closed jneuff closed 7 months ago

jneuff commented 9 months ago

We have a use-case where we have a trained tokenizer and later want to add some tokens to get the vocab size up to a certain number. If we use add_tokens or add_special_tokens we end up with something like this:

{
  "version": "1.0",
  "truncation": null,
  "padding": null,
  "added_tokens": [
    {
      "id": 0,
      "content": "my-place-holder",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    }
  ],
  "normalizer": null,
  "pre_tokenizer": null,
  "post_processor": null,
  "decoder": null,
  "model": {
    "type": "BPE",
    "dropout": null,
    "unk_token": "[UNK]",
    "continuing_subword_prefix": null,
    "end_of_word_suffix": null,
    "fuse_unk": false,
    "byte_fallback": false,
    "vocab": {},
    "merges": []
  }
}

When loading that tokenizer, e.g.

use tokenizers::Tokenizer;

fn main() {
    env_logger::init();
    let _t = Tokenizer::from_file("./my-tokenizer.json").unwrap();
}

we get a warning:

[2023-10-13T13:34:00Z WARN  tokenizers::tokenizer::serialization] Warning: Token 'my-place-holder' was expected to have ID '0' but was given ID 'None'

What is your recommendation to get rid of that warning?

One thing I can do is also adding the token to the model.vocab. But 1. I don't know how to do this aside from manually editing the json file (which feels akward) and 2. conceptually not having an ID for that token makes sense for our use-case as it is only in their to get to a certain vocab size.

ArthurZucker commented 9 months ago

Mmm the token when added should be added at the end of the vocab not the beginning. In that case since the vocab is none it is probably erroring out. Is the actualy use case also with an empty vocab?

jneuff commented 8 months ago

The actual use-case is not with an empty vocab. But the result is the same, the token is added to the added_tokens and not to model.vocab and we get the warning from above.

For a token that is just there to pad the vocab size, that makes sense. But that is probably not the most common reason to use add_special_token. So the result seems really strange to me.

I'd like to understand:

  1. Why are added special tokens not added to model.vocab ?
  2. Is there a way to add tokens just for padding? For this it would be fine or even intended that they don't get added to model.vocab.

Either way, when using the canonical methods to do this, I'd like to not end up with a tokenizer that produces warnings.

jneuff commented 8 months ago

I tried to find out if there is any difference in tokenization/detokenization behavior depending on the added tokens being part of the model.vocab or not. As far as I can see, it does not make any difference at all.

So my conclusion would be, that add_special_token should definitely add the tokens to the model.vocab.

ArthurZucker commented 8 months ago

Oh sorry I think I might understand what is going here. Because the index in the saved file is 0, and the length of the vocab is probably bigger, it's gonna look for vocab[0] and probably see that it is different (a different token). When adding tokens, they should usually be added at the end of the vocab. If you use tokenizer.add_tokens(["hey"]) the only way for the index of this token to be < vocab size is if it's already part of the vocab.

How did you add the tokens?

jneuff commented 8 months ago

For the above example, I used this code:

use tokenizers::{AddedToken, Tokenizer};

fn main() {
    let mut t = Tokenizer::from_file("./my-tokenizer.json").unwrap();
    t.add_special_tokens(&[AddedToken::from("my-placeholder", true)]);
    std::fs::write("./my-tokenizer.json", t.to_string(true).unwrap()).unwrap();
}

The vocab was empty before, so the token was not already present. With the tokenizer that actually caused this problem, the vocab was not empty before, but the token was definitely not already present.

So is this a bug of add_special_tokens? Should it not add the token to the vocab as well?

jneuff commented 8 months ago

This gist is how I fixed our tokenizer to get rid of the warning.

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.