About DPO formatting before fine-tuning

huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences

https://huggingface.co/HuggingFaceH4

Apache License 2.0

4.2k stars 357 forks source link

About DPO formatting before fine-tuning #116

Closed alvarobartt closed 4 months ago

alvarobartt commented 5 months ago

Description

Hi here! 🤗

I was wondering what's the reason under the current approach for preparing the datasets.Dataset before DPO fine-tuning, since now it seems that the assistant token is included within the chosen and rejected samples, rather than keeping it as part of the prompt i.e. the tokenizer.apply_chat_template call on the prompt does not have the add_generation_prompt=True.

Shouldn't it be part of the prompt so that the chosen and rejected is only the response itself?

Example for `chosen`

[
  {"content": "What's the capital of Spain?", "role": "user"},
  {"content": "Madrid", "role": "assistant"},
]

Before

prompt

<|system|>
</s>
<|user|>
What's the capital of Spain?</s>
<|assistant|>

chosen

Madrid

Now

prompt

<|system|>
</s>
<|user|>
What's the capital of Spain?</s>

chosen

<|assistant|>
Madrid

Thanks in advance!

eryk-mazus commented 5 months ago

I noticed the same thing today, while studying their code. Whether that's deliberate or not, I don't think it will hurt the performance if we won't add response tokens to the prompt

dctanner commented 5 months ago

It looks like this change was made in https://github.com/huggingface/alignment-handbook/commit/f0ffa0d7a6ab666b1f80f3f7dbb3c6364ac31967#diff-0668e2e3ee795fdc034f50182f4719a5f8574357831f2e4705fa730ed2db5831L76 by @lewtun but I can't spot an explanation. It looks delivery, so it's probably safe to assume it doesn't affect performance.

alvarobartt commented 4 months ago

Hi @lewtun, friendly pinging you here!

Did you see any performance issue when adding the generation_prompt as part of the chosen and rejected pairs instead of keeping it within the prompt itself just as the former version? I'll be comparing both approaches, but just wondering whether there's an explanation backing the change, or simply because that worked better during the experiments you ran.

Thanks in advance 🤗

alvarobartt commented 4 months ago

Ok I've already run two full fine-tunes using DPO (similarly to the HuggingFaceH4/zephyr-7b-gemma-v0.1 recipe) and both approaches work similarly, so I guess there are no issues on adding the generation prompt as part of the chosen and rejected pairs, see the wandb screenshot below:

16bit is the full DPO fine-tune where the add_generation_prompt=True and then it's stripped from both chosen and rejected; while 16bit-no-gen-prompt is the full DPO fine-tune where add_generation_prompt=False and the chosen and rejected are tokenized normally.

huggingface / alignment-handbook

About DPO formatting before fine-tuning #116

Description

Example for chosen

Before

Now

Example for `chosen`