[DOC]: Question about Colossal-LLaMA-2

📚 The doc issue

I want to continue pretraining llama2 with my own domain data. My data are about 1 billion tokens. To avoid catastrophic forgetting, we should add some pretraining data. But Meta don't release pretraining data. In Colossal-LLaMA-2 readme doc, there is a Knowledge replay stage:

Knowledge is replayed through a question-answering (QA) mechanism, encompassing both the Chinese and English domains.

Could you tell me the detail of this step and which dataset do you use? Thank you.

hpcaitech / ColossalAI

[DOC]: Question about Colossal-LLaMA-2 #4985

📚 The doc issue