Closed zhang7346 closed 8 months ago
@zhang7346 Can you post the results or error message? There may be some fluctuations as the implementation of vLLM may change.
By the way, you can check out the independent evalplus leaderboard, which reports slightly higher results than ours.
It's likely a vLLM issue, as the latest vLLM produces a lot of empty answers during evaluation. We're actively fixing it up.
Closing as vllm>=0.3.3
will fix this issue, and we'll update the package requirements in the next release. Re-open if needed.
Thank you for your reply! Now I use vllm==0.3.3, I can reproduce almost all the benchmarks(bbh_mc bbh_cot aigeval gsm8k turthfuqa mmlu) excepet human-eval.
I run the follow commads: ''' python -m ochat.evaluation.run_eval --condition "GPT4 Correct" --model openchat/openchat-3.5-0106 --eval_sets coding python ochat/evaluation/view_results.py python ochat/evaluation/convert_to_evalplus.py ''' and then I run th evaluate outside of docker: ''' evalplus.evaluate --dataset humaneval --samples /ochat/evaluation/evalplus_codegen/openchat3.5-0106_vllm033_transformers4382.jsonl ''' I got '''Base {'pass@1': 0.25} Base + Extra {'pass@1': 0.23780487804878048} '''
I use
and
to reproducing benchmarks of code,math and other reasoning benchmarks,but I can't reproducing the score listed in readme.
I use the newest commit 30da91b20f, transformer 4.36.1/4.36.2, ochat 3.5.1, vllm 0.2.1