hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.77k stars 4.34k forks source link

[DOC]: Question about Colossal-LLaMA-2 #4985

Open fancyerii opened 1 year ago

fancyerii commented 1 year ago

📚 The doc issue

I want to continue pretraining llama2 with my own domain data. My data are about 1 billion tokens. To avoid catastrophic forgetting, we should add some pretraining data. But Meta don't release pretraining data. In Colossal-LLaMA-2 readme doc, there is a Knowledge replay stage:

Knowledge is replayed through a question-answering (QA) mechanism, encompassing both the Chinese and English domains.

Could you tell me the detail of this step and which dataset do you use? Thank you.

fancyerii commented 1 year ago

anyone could help?