Open sagorbrur opened 6 months ago
Hi. I did a little research. I have a dataset of texts in 41mb. "Large dataset". This is a dataset collected from various sources - fiction, wiki, blogs. Small pieces of different topics. From this dataset I extracted an smaller dataset of 854 kb "Small Dataset". It also contains different topics.
I trained a tokenizer with a size of 6000 tokens on each of these 2 datasets. Here are the comparison results: "Large dataset" - trained for 6-8 hours on an old CPU. Let's take the result of tokenization as a "standard". "Small dataset" - trained for about 12 minutes. The tokenization result coincides with the "standard" by 61.8%. I mean set1.intersection(set2) from the received tokens (only the tokens, without their indices).
More than half coincidence. Not much, but not little either. The conclusion seems obvious - you can train a tokenizer on a small sample of data. I think the best result will be achieved if the sample is divided in such a way that it covers all the topics available in the large dataset.
I am not an expert in this field, my conclusions are based on personal attempts to understand the topic of tokenizers.
Hi @vladimirzenin , Thanks for your input. You right. A small subset will be enough for the tokens. But for our cases, Bengali is a diverse language. We have separated ~20 GB of data to train the tokenizer to grab the actual sub-word understanding. But it seems hard to train with this module. In respect to your conclusion, for the Bengali language if we separate a small portion it won't even be close to the original distribution of the words. But there might be an efficient way. Thanks again.
Hi, I am trying to train Tiktoken on a custom dataset (size 15 GB) with 30k vocab size. It seems it will take a long time to finish. 1 vocab update took almost 8 hours. Any suggestion to make it faster? Thanks in advance.
https://github.com/openai/tiktoken/blob/c0ba74c238d18b4824c25f3c27fc8698055b9a76/tiktoken/_educational.py#L117