Open skerit opened 1 week ago
@skerit Apologies on the delay - it's best to ask on our Discord server :)
@skerit Apologies on the delay - it's best to ask on our Discord server :)
No worries!
I actually did, but nobody seems to know there too :sweat_smile:
I fine-tuned Llama 3.1 8B on 1 epoch of 36.000 samples, with the sample token length ranging from 1000 to 20.000 tokens. When looking at the average length of a sample, it's only around 2000 tokens though. There are 1600 samples that are over 5000 tokens in length.
I'm training on completions only. I am teaching it my own, custom prompt format. There are over 10.000 samples where the completion is over 1000 tokens long.
I'm using a 128 rank, 256 alpha. My batch size is 1, while my gradient accumulation is 8.
Loss
The train loss & eval loss seemed to do OK. On average, train loss went from over 1.4 to 1.23 Eval loss went from 1.18 to 0.96
Testing it
But when I actually finally inference something (a sample that was even in the training data), it just starts to repeat itself very, very quickly:
For example:
And it goes on and on. I can easily make it write other stories that seem fine for a few sentences, then start to repeat themselves in some way after a while.
So is something wrong with finetuning on longer outputs? Or do I still not have enough data? Or does finetuning a base model just require a lot more data?