huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.68k stars 746 forks source link

loading `added_tokens.json` #1422

Closed kczimm closed 6 months ago

kczimm commented 6 months ago

Given a Tokenizer what is the appropriate way to add tokens from an added_tokens.json file of the format:

{
  "<\|im_end\|>": 32000,
  "<\|im_start\|>": 32001
}

I see the Tokenizer.add_tokens method. Should the user just create AddedTokens from this file? Could we make something like Tokenizer.add_tokens_from_file?

ArthurZucker commented 6 months ago

This file is not meant to be used by the tokenizers library but only the transformers library. On top of this, it was deprecated! You should add tokens to a tokenizer using the add_tokens

kczimm commented 6 months ago

Well that's good to know! Do you happen to have a link to the deprecation? I'm interested in learning what is supposed to replace it. I'll close in the meantime since as you say this does not pertain to tokenizers. Thanks!

ArthurZucker commented 6 months ago

The replacement is introduced by https://github.com/huggingface/transformers/pull/23909, the tokenizer_config.json includes the added_tokens_decoder argument!