Open parap1uie-s opened 9 months ago
Hi, have you figure out the reason? I have met the same problem when finetuning lora both on the official data and our downstream task data.
Hi, have you figure out the reason? I have met the same problem when finetuning lora both on the official data and our downstream task data.
Not yet. However, I guess the cause of the error is NaN or INF appearing during training. Larger per_device_train_batch_size
or gradient_accumulation_steps
might helps, and shorter training periods may also reduce the probability of ♂ occurrences.
Still looking forward to a reply from the maintainer.
Please try lowering the learning rate to 1e-4 or lower, which can remedy the training instability. Thanks.
Please try lowering the learning rate to 1e-4 or lower, which can remedy the training instability. Thanks.
Thanks for the reply, I will try reduce the learning rate. And...Is this anomaly related to Loss Spike? As reported from A Theory on Adam Instability in Large-Scale Machine Learning.
@parap1uie-s have you had a chance to solve this problem? I am having the same problem using default data in readme.
Here's my training loss and epoch. I don't understand how it spikes when it was stable throughout multiple epochs. Learning rate is also slowly decreasing.
Describe the issue
Issue:
I am finetuning LLaVA 1.5 13b using
scripts/v1.5/finetune_task_lora.sh
on my custom dataset. Training process looks normal (~0.4) until a iteration (randomly, no pattern found yet), the lossFly away
and can not descent until the epoch end.My dataset consists of a single round of chinese conversation. After the finetune procedure, I use serve.cli to check the output of my model. However, it goes like garbled UTF-8 code like:
Any suggestions? Thanks in advance.
Command:
Log:
Screenshots: You may attach screenshots if it better explains the issue.