huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.93k stars 777 forks source link

Encode special tokens #1437

Closed ArthurZucker closed 8 months ago

ArthurZucker commented 8 months ago

Allow skipping special tokens when encoding

fixes #1347, fixes #1391 fixes #1368

Snippet:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("gp2") 
>>> tokenizer.tokenize("<|endoftext|>")
['<|endoftext|>']

>>> tokenizer._tokenizer.encode_special_tokens = True
>>> tokenizer.tokenize("<|endoftext|>")
['<', '|', 'end', 'of', 'text', '|', '>']
HuggingFaceDocBuilderDev commented 8 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.