Closed zhmzm closed 3 months ago
Hi @zhmzm,
Thanks for your attention! I recently used LLaMA-Factory to train LLaMA3-8B using gradient descent, and I encountered an OOM error even with batch_size=1. This issue might be due to the training framework, as I made only minimal changes to LLaMA-Factory for the GA method.
I recommend using two A100 GPUs for training GA and RT, and three to four A100 GPUs for training DPO and NPO.
Thanks for your reply. I encountered a problem that I can use LLaMA-Factory sft a larger model but I failed to train it with GA in this repo. I will check the code.
What is the average length of your sft data?
I set the length to 1024 in LLaMA-Factory and batch size to 1, while in this repo I set the length to 128 and the same batch size. Are there other hyper-parameters I should consider as well? And do you know if the repo includes retain sets for training? Or does it only calculate the loss on forget sets?
It's strange, I didn't change the original codebase of LLaMA-Factory. I just uploaded a script to inject the knowledge via the pretrain loss in the original LLaMA-Factory, you can try to see if it works.
I didn't provide the retain corpus for training. But I plan to add retain loss and corpus in the future. You can choose some other famous people outside RWKU as retain targets, or sample some texts from wikitext.
Thanks for your assistance!
Hi, Thanks for sharing the impressive code!
The computation cost of this repo is higher than expected. As LLaMA-Factory suggested, a 7B model would only require 60GB GPU memory for full fine-tune. However, it requires about 160 GB when I run full/run_ga.sh.
Which step increases the cost?