huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.69k stars 747 forks source link

Question: what is the add_special_tokens parameter of Tokenizer::encode? #1375

Closed EricLBuehler closed 8 months ago

EricLBuehler commented 8 months ago

As stated above, what does the parameter add_special_tokens do? Does it add bos/eos tokens? Thanks!

ArthurZucker commented 8 months ago

It uses the template processing save in the tokenizer file to add special tokens at the end / beginning depending on the function used. A good example is in codellama

EricLBuehler commented 8 months ago

Ok, thanks! Does this apply to the Rust API, too? I am developing candle_llm_dataset for Candle, and so I need to know this for a from_iter method.

ArthurZucker commented 8 months ago

Pretty sure it does yes 😉 See this flag. It's just not implemented the same way for slow tokenizers in transformers but should not be a problem. See the doc on template processors

EricLBuehler commented 8 months ago

Ok, great. Thanks!