[X] I have read the README and searched the existing issues.
System Info
I'm working on the Llama-Factory project with a private, company-specific document (~1 million tokens) and need the model to directly answer questions based solely on this data—without relying on context injection, RAG, or any external tools. To achieve this, I conducted continued pretraining on a base model to incorporate new knowledge. After training for over 50 epochs, the model began displaying knowledge from my dataset, like correctly responding to "What is the title of the document?" with the expected answer of title of document . However, it struggled with other questions, such as "What does this terminal command do?" and always generated garbage text at the end of responses until context ended. (maybe I used base model to pretrain)
To address this, I attempted to convert the pretrained base model into an instruct/chat model through fine-tuning. While this stopped garbage text generation, it also led to the model forgetting some of the dataset-specific knowledge acquired during pretraining.
My question is: How can I ensure the model retains knowledge from my private dataset?
I used the C4 format for pretraining and Alpaca format for fine-tuning.
Any guidance on resolving these issues would be greatly appreciated.)
Reminder
System Info
I'm working on the Llama-Factory project with a private, company-specific document (~1 million tokens) and need the model to directly answer questions based solely on this data—without relying on context injection, RAG, or any external tools. To achieve this, I conducted continued pretraining on a base model to incorporate new knowledge. After training for over 50 epochs, the model began displaying knowledge from my dataset, like correctly responding to "What is the title of the document?" with the expected answer of title of document . However, it struggled with other questions, such as "What does this terminal command do?" and always generated garbage text at the end of responses until context ended. (maybe I used base model to pretrain)
To address this, I attempted to convert the pretrained base model into an instruct/chat model through fine-tuning. While this stopped garbage text generation, it also led to the model forgetting some of the dataset-specific knowledge acquired during pretraining.
My question is: How can I ensure the model retains knowledge from my private dataset? I used the C4 format for pretraining and Alpaca format for fine-tuning.
Any guidance on resolving these issues would be greatly appreciated.)
Reproduction
Expected behavior
No response
Others
No response