huggingface / trl

Train transformer language models with reinforcement learning.
http://hf.co/docs/trl
Apache License 2.0
9.29k stars 1.17k forks source link

Can DPO be used to shorten the model response length preference? #2003

Open hengjiUSTC opened 2 weeks ago

hengjiUSTC commented 2 weeks ago

System Info

trl official DPO examples. Finetune llama3.1 with lora. params:

lora_rank: 32 lora_target: all pref_beta: 0.2 pref_loss: sigmoid

dataset

dataset: train_data template: chatml cutoff_len: 4096 max_samples: 5000 overwrite_cache: true

train

per_device_train_batch_size: 1 gradient_accumulation_steps: 1 optim: paged_adamw_32bit learning_rate: 1.0e-6 num_train_epochs: 2.0 lr_scheduler_type: cosine warmup_ratio: 0.05 bf16: true ddp_timeout: 180000000

Information

Tasks

Reproduction

Recently, I've been trying to adjust the model's response length using DPO, but it hasn't had any effect. The training method involves using shorter answers as "chosen" and longer answers as "rejected." The training loss is normal, but when testing with the trained model, the response length hasn't changed. Could it be that DPO by default performs length normalization, making it unsuitable for optimizing the model's preference for response length?

Expected behavior

Expecting the model response is shortened after DPO

hengjiUSTC commented 2 weeks ago

training loss seems normal.

Screenshot 2024-09-01 at 20 17 18 Screenshot 2024-09-01 at 20 17 30 Screenshot 2024-09-01 at 20 17 30
northern-64bit commented 2 weeks ago

Hi @hengjiUSTC!

I am no lora and dpo expert, but I believe that you are correct that there is length cutting. Take a look at this function in the dpo trainer: https://github.com/huggingface/trl/blob/850ddcf598984013007d384c6b3e311def2a616e/trl/trainer/dpo_trainer.py#L149

Here we have the following code:

        c_len = len(c_tokens["prompt_input_ids"])
        r_len = len(r_tokens["prompt_input_ids"])
        min_len = min(c_len, r_len)

        for k, v in p_tokens.items():
            p_tokens[k] = v[:min_len]

This essentially finds the shorter response and truncates the longer response, so that they have the same length. Therefore it only trains on the "common" token length and your training does not have the intended consequence.

So you probably have to try another technique to make it prefer shorter responses. To encourage the model to generate shorter responses, you might consider modifying the approach to include penalties for longer responses directly in the loss function or adjust the reward mechanism to favor shorter responses. Another potential approach is to experiment with modifying the reward calculation to explicitly factor in the length of the responses, where shorter responses receive higher rewards.

I hope that this helps 😄