about evaluating Simpo-v0.2 by arena-hard

princeton-nlp / SimPO

SimPO: Simple Preference Optimization with a Reference-Free Reward

MIT License

649 stars 39 forks source link

about evaluating Simpo-v0.2 by arena-hard #68

Open jimmy19991222 opened 4 hours ago

jimmy19991222 commented 4 hours ago

Hi, I tried to eval the Llama-3-Instruct-8B-SimPO-v0.2 checkpoint by arena-hard-auto, and I only got

Llama-3-Instruct-8B-SimPO-v0.2 | score: 35.4 | 95% CI: (-3.2, 2.0) | average #tokens: 530

while your paper reported 36.5

So I am wondering if my vllm api server setting is right:

python3 -m vllm.entrypoints.openai.api_server \
        --model path-to-SimPO-v0.2 \
        --host 0.0.0.0 --port 5001 --served-model-name SimPO-v0.2 \
        --chat-template templates/llama3.jinja

jimmy19991222 commented 4 hours ago

I have checked that there is no '<|eot_id|>' in the end of generated answers