How much training data is used for the tokenizer?

openlm-research / open_llama

OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset

Apache License 2.0

7.29k stars 372 forks source link

How much training data is used for the tokenizer? #34

Closed DinhLuan14 closed 1 year ago

DinhLuan14 commented 1 year ago

I have a question that I hope will be addressed:

Does the LLaMa Tokenizer and Open_LLaMa train on the entire dataset or only on a small portion of the data? I couldn't find any supporting documentation regarding this matter.

young-geng commented 1 year ago

We used the 1B token sample of the RedPajama dataset located here

DinhLuan14 commented 1 year ago

@young-geng Why didn't you utilize the entire 1T dataset for training the tokenizer and instead only used a sample of 1B tokens? I noticed that the tokenizer training time is relatively insignificant compared to model training.

young-geng commented 1 year ago

I remember that training on 1B tokens took at least an hour, so training it on 1T tokens would take at least 40 days, which is much longer than the actual model training. Usually people don't train tokenizers on that many tokens, as tokenizer training is doing just byte pair encoding instead of learning an actual model.