add option to skip special tokens

huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production

https://huggingface.co/docs/tokenizers

Apache License 2.0

8.93k stars 777 forks source link

add option to skip special tokens #1419

Closed ArthurZucker closed 8 months ago

ArthurZucker commented 9 months ago

Allow skipping special tokens when encoding

fixes #1347, fixes #1391 fixes #1368

HuggingFaceDocBuilderDev commented 9 months ago

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker commented 9 months ago

This works as expected for now:

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("gp2") 
>>> tokenizer.tokenize("<|endoftext|>")
['<|endoftext|>']

>>> tokenizer._tokenizer.encode_special_tokens = True
>>> tokenizer.tokenize("<|endoftext|>")
['<', '|', 'end', 'of', 'text', '|', '>']

the goal is to support passing this as a kwargs, similarly to the slow! This way you can both save it and activate it in a __call__.

ArthurZucker commented 9 months ago

TODO :

[x] Evaluate on a benchmark if this does not slow down too much : good to go
[x] Add tests
[ ] open a PR in transformers for a followup

ArthurZucker commented 8 months ago

Before merging I just want to add a getter, and make sure we can just set it with tokenizer.encode_special_tokens = True.

ArthurZucker commented 8 months ago

PR is not on the correct branch lol