nlpxucan / WizardLM

LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath
9.11k stars 711 forks source link

Cannot reproduce WizardCoder-Python series #196

Closed yingfhu closed 10 months ago

yingfhu commented 10 months ago

We use huggingface WizardLM/WizardCoder-Python-34B-V1.0 but cannot reproduce metrics on humaneval(55 vs 73) using the prompt by your evaluation script(The low accuracy was caused by AssertionError but not invalid grammar)

However,the WizardCoder-xB-V1.0 series can achieve relative persuasive results.

Is there anything different between these two series when evaluating?Thanks.

ChiYeungLaw commented 10 months ago

You can follow this tutorial to reproduce the performance step-by-step. If you follow this, you can 100% get the same score.

ChiYeungLaw commented 10 months ago

One important point is that do not use transformers > 4.32.0.

NinedayWang commented 10 months ago

We use huggingface WizardLM/WizardCoder-Python-34B-V1.0 but cannot reproduce metrics on humaneval(55 vs 73) using the prompt by your evaluation script(The low accuracy was caused by AssertionError but not invalid grammar)

However,the WizardCoder-xB-V1.0 series can achieve relative persuasive results.

Is there anything different between these two series when evaluating?Thanks.

@yingfhu Hey, I faced the same problem. Mind sharing how you managed to solve it?

ChiYeungLaw commented 10 months ago

You can follow this tutorial to reproduce the performance step-by-step. If you follow this, you can 100% get the same score.

@NinedayWang Check this.

nicoladainese96 commented 10 months ago

Sorry what's the deal with transformers > 4.32.0. I believe that the discrepancy in performance between the WizardCode series based on Starcoder and the one based on LLama comes from how the base model treats padding. Could it be so?