Closed pierrj closed 4 months ago
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.
I think you can use the https://github.com/huggingface/transformers/blob/76a33a10923ccc1074917f6b6a1e719e626b7dc9/tests/tokenization/test_tokenization_fast.py#L121
the API is more or less defined in transformers
Hello,
I would like to generate vocabularies of different vocab_sizes so that I can compare their effects on downstream predictions. My dataset is quite large so training the tokenizer takes a long time (it is reasonable for just one vocab size but too long for testing many). From my understanding, BPE is done sequentially so I believe I am wasting a lot of time repeating the exact same token merging steps if I just retrain the tokenizer from scratch at each vocab size. Has anyone tried to do something like this before and have any suggestions?
Here's what I have so far: