nlpxucan / WizardLM

LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath
9.19k stars 712 forks source link

pass@1 on mbpp #122

Open chenkehua opened 1 year ago

chenkehua commented 1 year ago

The reproduced pass@1 result of StarCoder on the MBPP dataset is 43.6*, which differs from the reported result of 52.7 in the paper. Can you explain that?

ChiYeungLaw commented 1 year ago

dd2a8500f82222b3e475485f987ff49

The 43.6 score is evaluated on Google's MBPP with 500 problems. Our WizardCoder is also evaluated on the same data. The 52.7 is evaluated on MultiPL-E's MBPP (397 problems).

weiliang-zeng commented 1 year ago

Thanks for the clarification here. Very helpful! @ChiYeungLaw , could you explain a bit more about how did you get the 43.6 for Starcoder? Is that based on the Eval Harness or mbpp_gen.py in your repo? Could you provide the command line for reproduction purposes?

ChiYeungLaw commented 1 year ago

We follow the same prompt as Eval Harness to evaluate StarCoder on MBPP.

ammuntasirrahman commented 1 year ago

Is it prompt = f'"""\n{description}\n{test_example}\n"""\n' or do you include the code_solution?

haorannlp commented 1 year ago

@ChiYeungLaw May I ask why replacing 4 space chars with a tab before generation?