Open hengjiUSTC opened 2 months ago
training loss seems normal.
Hi @hengjiUSTC!
I am no lora and dpo expert, but I believe that you are correct that there is length cutting. Take a look at this function in the dpo trainer: https://github.com/huggingface/trl/blob/850ddcf598984013007d384c6b3e311def2a616e/trl/trainer/dpo_trainer.py#L149
Here we have the following code:
c_len = len(c_tokens["prompt_input_ids"])
r_len = len(r_tokens["prompt_input_ids"])
min_len = min(c_len, r_len)
for k, v in p_tokens.items():
p_tokens[k] = v[:min_len]
This essentially finds the shorter response and truncates the longer response, so that they have the same length. Therefore it only trains on the "common" token length and your training does not have the intended consequence.
So you probably have to try another technique to make it prefer shorter responses. To encourage the model to generate shorter responses, you might consider modifying the approach to include penalties for longer responses directly in the loss function or adjust the reward mechanism to favor shorter responses. Another potential approach is to experiment with modifying the reward calculation to explicitly factor in the length of the responses, where shorter responses receive higher rewards.
I hope that this helps 😄
System Info
trl official DPO examples. Finetune llama3.1 with lora. params:
lora_rank: 32 lora_target: all pref_beta: 0.2 pref_loss: sigmoid
dataset
dataset: train_data template: chatml cutoff_len: 4096 max_samples: 5000 overwrite_cache: true
train
per_device_train_batch_size: 1 gradient_accumulation_steps: 1 optim: paged_adamw_32bit learning_rate: 1.0e-6 num_train_epochs: 2.0 lr_scheduler_type: cosine warmup_ratio: 0.05 bf16: true ddp_timeout: 180000000
Information
Tasks
examples
folderReproduction
Recently, I've been trying to adjust the model's response length using DPO, but it hasn't had any effect. The training method involves using shorter answers as "chosen" and longer answers as "rejected." The training loss is normal, but when testing with the trained model, the response length hasn't changed. Could it be that DPO by default performs length normalization, making it unsuitable for optimizing the model's preference for response length?
Expected behavior
Expecting the model response is shortened after DPO