huggingface / alignment-handbook

Robust recipes to align language models with human and AI preferences
https://huggingface.co/HuggingFaceH4
Apache License 2.0
4.18k stars 354 forks source link

Cannot reproduce zephyr-7b-gemma-v0.1 #148

Closed jasonyux closed 2 months ago

jasonyux commented 3 months ago

I tried to reproduce zephyr-7b-gemma-v0.1 using the exact code provided in this repository with 4xA100 GPUs. However, the resulting MT-bench test score was much lower than reported: 6.63, versus the reported value on huggingface pages: 7.81.

I wonder if anyone else is encountering this issue?

Command ran (the same as what's mentioned in the repo but modified gradient accumulation since I am using only 4xA100)

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/deepspeed_zero3.yaml \
scripts/run_dpo.py recipes/zephyr-7b-gemma/dpo/config_full.yaml \
--output_dir=xxx/zephyr-7b-gemma-dpo-full_reprod \
--num_train_epochs=2 \
--gradient_accumulation_steps=16

and when generating model answers for MT-bench I used the default commands:

python gen_model_answer.py --model-path [MODEL-PATH] --model-id [MODEL-ID]

Related library versions I used:

training curves from wandb: image

eval reward curves: image

jasonyux commented 2 months ago

It seems that the issue is with chat templates used by fastchat during evaluation. Using the following templates to test H4's gemma models recovers the reported performance:

from fastchat.conversation import register_conv_template

register_conv_template(
    Conversation(
        name="templ=h4_gemma_chatml",
        system_template="<bos><|im_start|>system\n{system_message}",
        system_message="You are an AI assistant.",
        roles=("<|im_start|>user", "<|im_start|>assistant"),
        sep_style=SeparatorStyle.CHATML,
        sep="<|im_end|>",
        stop_str=["<|im_end|>", "<|endoftext|>"],
    )
)

# other init code omitted
fanconic commented 1 month ago

May I ask where this template originates from?

jasonyux commented 1 month ago

This relates from how the model is trained using the run_dpo.py script. In that script, chat data is first formatted using tokenizer's template and then fed into Trainer. Unless you use (maybe) the latest version of fschat (which uses hardcoded templates), fschat will not use that same template; which leads to performance degradation.