Open xijiu9 opened 4 months ago
Hello, I observe that the models I trained and the official models provided by HuggingFace do not match the results of Zephyr-7b-beta on TruthfulQA.
I used lm_evaluate_harness for evaluation, and the metric used was mc2.
The result for HuggingFaceH4/zephyr-7b-beta is 55.15, and the result for mistralai/Mistral-7B-7b-beta is 42.59. Both of these numbers are correct.
However, the result for alignment-handbook/zephyr-7b-dpo-full is only 45.07, and the results for alignment-handbook/zephyr-7b-sft-full is only 40.38.
Furthermore, the results from my own trained checkpoint are also incorrect. The result for sft-full is 40.12, and the result for dpo-full is 47.40.
Version of lm-evaluate-harness is this
I further do evaluation on some other datasets: the alignment-handbook/zephyr-7b-dpo-full model still performs worse than HuggingFaceH4/zephyr-7b-beta.
Hello, I observe that the models I trained and the official models provided by HuggingFace do not match the results of Zephyr-7b-beta on TruthfulQA.
I used lm_evaluate_harness for evaluation, and the metric used was mc2.
The result for HuggingFaceH4/zephyr-7b-beta is 55.15, and the result for mistralai/Mistral-7B-7b-beta is 42.59. Both of these numbers are correct.
However, the result for alignment-handbook/zephyr-7b-dpo-full is only 45.07, and the results for alignment-handbook/zephyr-7b-sft-full is only 40.38.
Furthermore, the results from my own trained checkpoint are also incorrect. The result for sft-full is 40.12, and the result for dpo-full is 47.40.
Version of lm-evaluate-harness is this