DPO loss - Githubissues

huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences

https://huggingface.co/HuggingFaceH4

Apache License 2.0

4.28k stars 367 forks source link

DPO loss #38

Open JiuhaiChen opened 8 months ago

JiuhaiChen commented 8 months ago

I am training DPO with lora, the loss has weird behavior: will decrease sharply at the beginning of each epoch. I wonder if you have same issue before?

ChenDRAG commented 8 months ago

It seems that full finetuning has this problem, while lora doesn't. Could you share the yaml training configuration? Also how many GPUs are you using?

JiuhaiChen commented 8 months ago

Thanks for your reply. I don't try the full model fine-tuning. For the lora, i only changed: gradient_accumulation_steps: 1, per_device_train_batch_size: 16, per_device_eval_batch_size: 4, save_strategy: "epoch". I am using the 8 A6000. Also, i am not sure if you observed the eval loss is increasing in the training.

ChenDRAG commented 8 months ago

Sorry, I did not encounter this problem. Do you use the official binary dataset? What is your base model? Though I don't think they matter that much.

JiuhaiChen commented 8 months ago

Yeah, i agree eval loss does not matter. For the lora, how many cards you are using?

ChenDRAG commented 8 months ago

8 A40 cards. My new experiments also encounter this problem.

Difference between the two configurations previous

bath size 4 accumulation 2 cards 8 lr 1e-7

new batch size 8 accumulation 1 cards 8 lr 1e-4

I think the main change it I increase lr a lot, are you sure you use a lr=1e-7 in your experiments?

NicolasMejiaPetit commented 8 months ago

I’m currently training a lora across all mistral modules with the standard setting with the exception of no eval, and a single batch size on a 3090. My loss is hitting .29 and it’s only been training for 180 steps. (.4 epochs).

edit: Epoch .52, 210 steps in, the loss is at .18 and rewards/accuracy is 1.0.

fblgit commented 8 months ago

quite weird, i just trained the DPO and my loss is normal across epochs, pretty much similar to the results shared on hf model card. how about rebase and try again ? definitively .29 or lower is because the model is seeing the right prediction token somehow.