Question about data ratio

openlm-research / open_llama

OpenLLaMA, a permissively licensed open source reproduction of Meta AI’s LLaMA 7B trained on the RedPajama dataset

Apache License 2.0

7.27k stars 370 forks source link

Question about data ratio #84

Closed siriusctrl closed 11 months ago

siriusctrl commented 11 months ago

Hi there, thanks for the amazing work. Do you mind sharing the data ratio (like XGen) for reproducing the pre-training results?

itsliupeng commented 11 months ago

I suppose it's similar to the use of LLama thesis, isn't it?

siriusctrl commented 11 months ago

I suppose it's similar to the use of LLama thesis, isn't it?

Thanks for sharing, quoting from the README

is trained on a mixture of Falcon refined-web dataset, mixed with the starcoder dataset, and the wikipedia, arxiv and books and stackexchange from RedPajama.

Just want to be clear, we are using random samples from Falcon refined-web dataset to replace CC+C4, and starcoder to replace Github. But how the sampling prop changed in this case as the released version of falcon refined web is actually small in size compare to what used in Llama? Can I assume that OpenLlama go through Falcon dataset with larger epochs?

young-geng commented 11 months ago

For the v1 model, we simply combined the partitions of RedPajama. We did not oversample any partitions. For the v2 model, we simply replaced the CC part of RedPajama with Falcon refined-web and the GitHub part with StarCoder data. We also didn't use any oversampling for the v2 model. So in short, for both the v1 and v2 models, we never train for more than 1 epoch on any subsets.

itsliupeng commented 11 months ago

Thank you for your response. However, after using the OpenLlama tokenizer on the public Falcon dataset, I only obtained 575B tokens. If I just train for one epoch, I can't achieve a total of 1TB tokens.

young-geng commented 11 months ago

Thank you for your response. However, after using the OpenLlama tokenizer on the public Falcon dataset, I only obtained 575B tokens. If I just train for one epoch, I can't achieve a total of 1TB tokens.

That's why we have the wiki, books, arxiv, stackexchange from RedPajama and StarCoder data in v2.

itsliupeng commented 11 months ago

Thank you for your response. However, after using the OpenLlama tokenizer on the public Falcon dataset, I only obtained 575B tokens. If I just train for one epoch, I can't achieve a total of 1TB tokens.

That's why we have the wiki, books, arxiv, stackexchange from RedPajama and StarCoder data in v2.

In the RedPajama dataset, the content from Wiki, books, arXiv, and StackExchange is quite limited. Hence, did you utilize the entire StarCoder dataset instead of just the 4.5% as mentioned in the Llama paper?"

young-geng commented 11 months ago

We used the entire starcoder dataset. Everything was trained for 1 epoch.

itsliupeng commented 11 months ago

Thanks, i see.