Errors "Using sep_token, but it is not set yet." loading tokenizer trained from scratch

velocityCavalry commented 9 months ago

Hi, I trained a tokenizer from scratch for raw wikitext103 using the code:

from tokenizers import Tokenizer, decoders
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace, ByteLevel
from tokenizers.processors import TemplateProcessing

tokenizer = Tokenizer(BPE(unk_token="<unk>"))
trainer = BpeTrainer(vocab_size=args.vocab_size,
                         special_tokens=['<s>', '<pad>', '</s>', '<unk>'], 
                         min_frequency=args.min_frequency,
                         show_progress=True,
                         )

tokenizer.post_processor = TemplateProcessing(
    single="<s> $A </s>",
    pair="<s> $A </s> <s> $B </s>",
    special_tokens=[("<s>", 0), ("</s>", 2)],  # bos: bos id, eos: eos id
)
tokenizer.train('wikitext-103-raw/wiki.train.raw', trainer)
tokenizer.save(args.output_path, pretty=True)

and it was saved to tokenizer.json. However, when I was trying to follow the tutorial to load the tokenizer by doing

from transformers import PreTrainedTokenizerFast
tokenizer = PreTrainedTokenizerFast(
        tokenizer_file="tokenizer.json",
        bos_token="<s>",
        eos_token="</s>",
        unk_token="<unk>",
        pad_token="<pad>",
    )

It gives me errors saying Using sep_token, but it is not set yet. Using cls_token, but it is not set yet. Using mask_token, but it is not set yet. However, when I was training the tokenizer or doing the postprocessing, no cls or sep or mask token was involved.

I wonder whether this is a feature, or something is wrong with my code. I wonder whether anyone has encountered similar problems before?

However, by eyeballing the encoded and decoded results, it looks fine,

In [20]: tokenizer(
    ...:         [["Hello, y'all!", "How are you 😁 ?"], ["Hello to you too!", "I'm fine, thank you!"]]
    ...:     )
Out[20]: {'input_ids': [[0, 27193, 15, 92, 10, 5067, 4, 2, 0, 7927, 5082, 6189, 3, 34, 2], [0, 27193, 5000, 6189, 6566, 4, 2, 0, 44, 10, 80, 10447, 15, 23846, 6189, 4, 2]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

In [21]: tokenizer.decode([0, 27193, 15, 92, 10, 5067, 4, 2, 0, 7927, 5082, 6189, 3, 34, 2])
Out[21]: "<s> Hello, y'all! </s> <s> How are you <unk>? </s>"

Thank you so much and appreciate for any help!

ArthurZucker commented 9 months ago

It's expected, but we can / should fix it. I'll see what I can do because it's not being accessed!

github-actions[bot] commented 7 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker commented 7 months ago

(It was fixed by disabling verbose in the PretrainedTokenizerBase) class

huggingface / tokenizers

Errors "Using sep_token, but it is not set yet." loading tokenizer trained from scratch #1360