delgermurun commented 1 year ago

Here is the reproducible script:

from tokenizers.trainers import BpeTrainer
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.pre_tokenizers import Split

# https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
t = """First Citizen:
Before we proceed any further, hear me speak.

..."""

tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
trainer = BpeTrainer(special_tokens=["[UNK]"], vocab_size=1000, min_frequency=2)
tokenizer.pre_tokenizer = Split("\w+|[^\w\s]+", behavior="isolated")

tokenizer.train_from_iterator(
    iterator=[t],
    trainer=trainer,
)

tokenizer.save("tokenizer.json")

Works fine if I use trained tokenizer directly (not loading from the file)

print(tokenizer.encode("""especially       against Caius Marcius?

All:
Against""").tokens)

Output: ['es', 'p', 'ec', 'i', 'all', 'y ', ' ', ' ', ' ', ' ', ' ', ' a', 'gainst ', 'Caius Marc', 'i', 'us', '?\n\nAll:\n', 'A', 'gain', 'st']

But loading the tokenizer from the file fails.

tokenizer = Tokenizer.from_file("tokenizer.json")

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[88], line 1
----> 1 tokenizer = Tokenizer.from_file("tokenizer.json")

Exception: data did not match any variant of untagged enum ModelWrapper at line 382 column 3

Version: tokenizers==0.13.3

delgermurun commented 1 year ago

909 works for me! I'll go with this PR for now. Thank you @Narsil.

Narsil commented 1 year ago

Perfect, closing this for now.

Once the awesome model you're building get's merged in transformers we'll merge #909 to get it included !

jiaohuix commented 7 months ago

from tokenizers.pre_tokenizers import Whitespace

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = Whitespace()

it works for me~

huggingface / tokenizers

Error when loading tokenizer from a file: data did not match any variant of untagged enum ModelWrapper #1297

But loading the tokenizer from the file fails.

909 works for me! I'll go with this PR for now. Thank you @Narsil.