DOGEwbx commented 8 months ago

Hi, I'm using tokenizer for pretraining these days and I add some special tokens which I want to disable when I'm encoding the pretrain text. I tried the add_special_tokens = False for tokenizer.encode but it didn't work. Any ideas or solutions will be appeciated.

ArthurZucker commented 7 months ago

This is not supported yet. Adding the option to encode_special_tokens will be supported soon!

charchit7 commented 7 months ago

Hey @DOGEwbx could you share some scripts/examples here, I would love to try for a solution if @ArthurZucker approves.

DOGEwbx commented 6 months ago

Hi @charchit7 , thanks for your kindly help, there is a simple example.

from tokenizers import Tokenizer, decoders, models, normalizers, pre_tokenizers, trainers
tokenizer = Tokenizer(models.Unigram())
tokenizer.normalizer = normalizers.NFKC()
trainer = trainers.UnigramTrainer(
    vocab_size=200,
    special_tokens=["<PAD>", "<BOS>", "<EOS>"],
)
# First few lines of the "Zen of Python" https://www.python.org/dev/peps/pep-0020/
data = [
    "Beautiful is better than ugly."
    "Explicit is better than implicit."
    "Simple is better than complex."
    "Complex is better than complicated."
    "Flat is better than nested."
    "Sparse is better than dense."
    "Readability counts."
    "<<<<<>>>>>>"
]
tokenizer.train_from_iterator(data, trainer=trainer)
print(tokenizer.encode("<is>").ids)
tokenizer.add_special_tokens(["<is>"])
print(tokenizer.encode("<is>").ids)

it will output

[45, 11, 12, 48]
[51]

The function I want is a flag that can be passed to the tokenizer.encode which will ignore the special token in encoding

print(tokenizer.encode("<is>", ignore_special=True).ids)
# output [45, 11, 12, 48] instead of [51]

ArthurZucker commented 6 months ago

1419 will fix this 😉 it's long due sorry for that

charchit7 commented 6 months ago

Ahh, thanks @ArthurZucker let me know if I can contribute!

ArthurZucker commented 6 months ago

Sure, when ready I'll ping you for some testing if you want! 😉

charchit7 commented 6 months ago

Sure, @ArthurZucker happy to help!

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

huggingface / tokenizers

How can we ignore special tokens when encoding text #1368

1419 will fix this 😉 it's long due sorry for that