hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
https://arxiv.org/abs/2403.13372
Apache License 2.0
34.29k stars 4.23k forks source link

Issue with Adding Company Private Dataset Knowledge to LLM Without Context or External Tools #6034

Closed AlgoAIBoss closed 6 hours ago

AlgoAIBoss commented 6 hours ago

Reminder

System Info

I'm working on the Llama-Factory project with a private, company-specific document (~1 million tokens) and need the model to directly answer questions based solely on this data—without relying on context injection, RAG, or any external tools. To achieve this, I conducted continued pretraining on a base model to incorporate new knowledge. After training for over 50 epochs, the model began displaying knowledge from my dataset, like correctly responding to "What is the title of the document?" with the expected answer of title of document . However, it struggled with other questions, such as "What does this terminal command do?" and always generated garbage text at the end of responses until context ended. (maybe I used base model to pretrain)

To address this, I attempted to convert the pretrained base model into an instruct/chat model through fine-tuning. While this stopped garbage text generation, it also led to the model forgetting some of the dataset-specific knowledge acquired during pretraining.

My question is: How can I ensure the model retains knowledge from my private dataset? I used the C4 format for pretraining and Alpaca format for fine-tuning.

Any guidance on resolving these issues would be greatly appreciated.)

Reproduction

5310301847313049867

Expected behavior

No response

Others

No response

hiyouga commented 6 hours ago

You should try augmenting your data and use SFT only to finetune the model