Zephyr-dpo-full Checkpoints perform poorly on TruthfulQA.

Hello, I observe that the models I trained and the official models provided by HuggingFace do not match the results of Zephyr-7b-beta on TruthfulQA.

I used lm_evaluate_harness for evaluation, and the metric used was mc2.

The result for HuggingFaceH4/zephyr-7b-beta is 55.15, and the result for mistralai/Mistral-7B-7b-beta is 42.59. Both of these numbers are correct.

However, the result for alignment-handbook/zephyr-7b-dpo-full is only 45.07, and the results for alignment-handbook/zephyr-7b-sft-full is only 40.38.

Furthermore, the results from my own trained checkpoint are also incorrect. The result for sft-full is 40.12, and the result for dpo-full is 47.40.

Version of lm-evaluate-harness is this

huggingface / alignment-handbook