How to reproduce the result of 59.8 on humaneval with your wizardcoder? The result I reproduced is only 47.

nlpxucan / WizardLM

LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath

9.19k stars 712 forks source link

How to reproduce the result of 59.8 on humaneval with your wizardcoder? The result I reproduced is only 47. #128

Open lmx760581375 opened 1 year ago

lmx760581375 commented 1 year ago

I downloaded the model weight of wizardcoder on huggingface, and tested N=200 on humaneval. The test script is your humaneval_gen.py. temperature=0.2 top_p=0.95 is the same as that mentioned in your paper, but the result I measured was only 47.

ChiYeungLaw commented 1 year ago

From your figure, I find that you evaluate WizardCoder with the Humaneval-X. This is different from OpenAI's HumanEval. To reproduce the results, you can follow this. Or you can try this great toolkit code-eval.

The performance is 100% reproducible. If you still have problems, please contact us.

lmx760581375 commented 1 year ago

In fact, the prompts of humaneval-x and humaneval should be the same.

I have checked and found that my generated parameters are still different from yours, because I did not see the parameter configuration under readme due to my negligence.

When my temperature was set to 0 and the sampling strategy greedy sampling was turned on, My effect approximates your effect, my N=1 and everything is aligned according to your readme.

But I have a big question, does the fact that the parameters have such a large effect mean that the robustness is not good? The result of this test makes me feel that I can improve by 10 points just by adjusting the parameters, which feels exaggerated. 企业微信截图_16893227586610

ChiYeungLaw commented 1 year ago

We have done three experiments:

n=1, Greedy Decoding, pass@1 on HumanEval is 59.8
n=20, temperature=0.2 top_p=0.95, pass@1 on HumanEval is 57.3
n=200, temperature=0.2 top_p=0.95 pass@1 on HumanEval is 57.2

From our results, the robustness of WizardCoder is good. Since the 59.8 score is from greedy decoding, this won't be affected by any randomness. If you cannot get this score, we believe your testing process must contain some problems. Since our models are evaluated with OpenAI's HumanEval, we suggest you follow our instructions to reproduce the results.