Outputting many different tokenizer vocab sizes for comparisons

pierrj commented 5 months ago

Hello,

I would like to generate vocabularies of different vocab_sizes so that I can compare their effects on downstream predictions. My dataset is quite large so training the tokenizer takes a long time (it is reasonable for just one vocab size but too long for testing many). From my understanding, BPE is done sequentially so I believe I am wasting a lot of time repeating the exact same token merging steps if I just retrain the tokenizer from scratch at each vocab size. Has anyone tried to do something like this before and have any suggestions?

Here's what I have so far:

from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer

for vocab_size in vocab_sizes:
   tokenizer = Tokenizer(BPE(unk_token="[UNK]"))
   trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"], vocab_size=vocab_size,    min_frequency=0,show_progress=True)
   tokenizer.train_from_iterator(iterator=lst, trainer=trainer)

github-actions[bot] commented 4 months ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

ArthurZucker commented 3 months ago

I think you can use the https://github.com/huggingface/transformers/blob/76a33a10923ccc1074917f6b6a1e719e626b7dc9/tests/tokenization/test_tokenization_fast.py#L121

the API is more or less defined in transformers

huggingface / tokenizers

Outputting many different tokenizer vocab sizes for comparisons #1445