Closed DOGEwbx closed 5 months ago
This is not supported yet. Adding the option to encode_special_tokens
will be supported soon!
Hey @DOGEwbx could you share some scripts/examples here, I would love to try for a solution if @ArthurZucker approves.
Hi @charchit7 , thanks for your kindly help, there is a simple example.
from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
tokenizer = Tokenizer(models.Unigram())
tokenizer.normalizer = normalizers.NFKC()
trainer = trainers.UnigramTrainer(
vocab_size=200,
special_tokens=["<PAD>", "<BOS>", "<EOS>"],
)
# First few lines of the "Zen of Python" https://www.python.org/dev/peps/pep-0020/
data = [
"Beautiful is better than ugly."
"Explicit is better than implicit."
"Simple is better than complex."
"Complex is better than complicated."
"Flat is better than nested."
"Sparse is better than dense."
"Readability counts."
"<<<<<>>>>>>"
]
tokenizer.train_from_iterator(data, trainer=trainer)
print(tokenizer.encode("<is>").ids)
tokenizer.add_special_tokens(["<is>"])
print(tokenizer.encode("<is>").ids)
it will output
[45, 11, 12, 48]
[51]
The function I want is a flag that can be passed to the tokenizer.encode
which will ignore the special token in encoding
print(tokenizer.encode("<is>", ignore_special=True).ids)
# output [45, 11, 12, 48] instead of [51]
Ahh, thanks @ArthurZucker let me know if I can contribute!
Sure, when ready I'll ping you for some testing if you want! 😉
Sure, @ArthurZucker happy to help!
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
Hi, I'm using tokenizer for pretraining these days and I add some special tokens which I want to disable when I'm encoding the pretrain text. I tried the
add_special_tokens = False
for tokenizer.encode but it didn't work. Any ideas or solutions will be appeciated.