huggingface / tokenizers

💥 Fast State-of-the-Art Tokenizers optimized for Research and Production
https://huggingface.co/docs/tokenizers
Apache License 2.0
8.68k stars 746 forks source link

Outputting many different tokenizer vocab sizes for comparisons #1445

Closed pierrj closed 4 months ago

pierrj commented 5 months ago

Hello,

I would like to generate vocabularies of different vocab_sizes so that I can compare their effects on downstream predictions. My dataset is quite large so training the tokenizer takes a long time (it is reasonable for just one vocab size but too long for testing many). From my understanding, BPE is done sequentially so I believe I am wasting a lot of time repeating the exact same token merging steps if I just retrain the tokenizer from scratch at each vocab size. Has anyone tried to do something like this before and have any suggestions?

Here's what I have so far:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

for vocab_size in vocab_sizes:
   tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
   trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=vocab_size,    min_frequency=0,show_progress=True)
   tokenizer.train_from_iterator(iterator=lst, trainer=trainer)
github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker commented 3 months ago

I think you can use the https://github.com/huggingface/transformers/blob/76a33a10923ccc1074917f6b6a1e719e626b7dc9/tests/tokenization/test_tokenization_fast.py#L121

the API is more or less defined in transformers