Closed DinhLuan14 closed 1 year ago
We used the 1B token sample of the RedPajama dataset located here
@young-geng Why didn't you utilize the entire 1T dataset for training the tokenizer and instead only used a sample of 1B tokens? I noticed that the tokenizer training time is relatively insignificant compared to model training.
I remember that training on 1B tokens took at least an hour, so training it on 1T tokens would take at least 40 days, which is much longer than the actual model training. Usually people don't train tokenizers on that many tokens, as tokenizer training is doing just byte pair encoding instead of learning an actual model.
I have a question that I hope will be addressed:
Does the LLaMa Tokenizer and Open_LLaMa train on the entire dataset or only on a small portion of the data? I couldn't find any supporting documentation regarding this matter.