imoneoi / openchat

OpenChat: Advancing Open-source Language Models with Imperfect Data
https://openchat.team
Apache License 2.0
5.26k stars 399 forks source link

Can not Reproduce benchmarks #186

Closed zhang7346 closed 8 months ago

zhang7346 commented 9 months ago

I use

python -m ochat.evaluation.run_eval --condition "GPT4 Correct" --model openchat/openchat-3.5-0106 --eval_sets coding fs_cothub/bbh fs_cothub/mmlu zs/agieval zs/bbh_mc_orca zs/truthfulqa_orca

and

python -m ochat.evaluation.run_eval --condition "Math Correct" --model openchat/openchat-3.5-0106 --eval_sets fs_cothub/gsm8k zs/math

to reproducing benchmarks of code,math and other reasoning benchmarks,but I can't reproducing the score listed in readme.

I use the newest commit 30da91b20f, transformer 4.36.1/4.36.2, ochat 3.5.1, vllm 0.2.1

imoneoi commented 9 months ago

@zhang7346 Can you post the results or error message? There may be some fluctuations as the implementation of vLLM may change.

By the way, you can check out the independent evalplus leaderboard, which reports slightly higher results than ours.

imoneoi commented 8 months ago

It's likely a vLLM issue, as the latest vLLM produces a lot of empty answers during evaluation. We're actively fixing it up.

imoneoi commented 8 months ago

Closing as vllm>=0.3.3 will fix this issue, and we'll update the package requirements in the next release. Re-open if needed.

zhang7346 commented 8 months ago

Thank you for your reply! Now I use vllm==0.3.3, I can reproduce almost all the benchmarks(bbh_mc bbh_cot aigeval gsm8k turthfuqa mmlu) excepet human-eval.

I run the follow commads: ''' python -m ochat.evaluation.run_eval --condition "GPT4 Correct" --model openchat/openchat-3.5-0106 --eval_sets coding python ochat/evaluation/view_results.py python ochat/evaluation/convert_to_evalplus.py ''' and then I run th evaluate outside of docker: ''' evalplus.evaluate --dataset humaneval --samples /ochat/evaluation/evalplus_codegen/openchat3.5-0106_vllm033_transformers4382.jsonl ''' I got '''Base {'pass@1': 0.25} Base + Extra {'pass@1': 0.23780487804878048} '''