I want to continue pretraining llama2 with my own domain data. My data are about 1 billion tokens. To avoid catastrophic forgetting, we should add some pretraining data. But Meta don't release pretraining data. In Colossal-LLaMA-2 readme doc, there is a Knowledge replay stage:
Knowledge is replayed through a question-answering (QA) mechanism, encompassing both the Chinese and English domains.
Could you tell me the detail of this step and which dataset do you use? Thank you.
📚 The doc issue
I want to continue pretraining llama2 with my own domain data. My data are about 1 billion tokens. To avoid catastrophic forgetting, we should add some pretraining data. But Meta don't release pretraining data. In Colossal-LLaMA-2 readme doc, there is a Knowledge replay stage:
Could you tell me the detail of this step and which dataset do you use? Thank you.