Closed ArthurZucker closed 8 months ago
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
This works as expected for now:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("gp2")
>>> tokenizer.tokenize("<|endoftext|>")
['<|endoftext|>']
>>> tokenizer._tokenizer.encode_special_tokens = True
>>> tokenizer.tokenize("<|endoftext|>")
['<', '|', 'end', 'of', 'text', '|', '>']
the goal is to support passing this as a kwargs, similarly to the slow!
This way you can both save it and activate it in a __call__
.
TODO :
Before merging I just want to add a getter, and make sure we can just set it with tokenizer.encode_special_tokens = True
.
PR is not on the correct branch lol
Allow skipping special tokens when encoding
fixes #1347, fixes #1391 fixes #1368