neulab / gemini-benchmark

150 stars 22 forks source link

Code Generation Evals should parse code from LM response #29

Closed manishshettym closed 3 months ago

manishshettym commented 9 months ago

Hi! Thanks for this great and extensive benchmarking work. I was looking at Section 6 in the paper and found this graph intriguing since it states that gpt3.5-turbo is better than gpt4-turbo on the ODEX benchmark. It was even more interesting because this is on slices of the dataset that includes very popular libraries: pandas, numpy, etc

image

Issue: Code not extracted from LM response

From the zenoml dashboard, it looks like code from gpt4-turbo might be correct (in fact same as 3.5-turbo) but failed because there is an NL description after the code, causing a syntax error.

image

Question: Should we be prompting better (e.g., with code delimiters ```) and parsing the code from the response before execution and evaluation?

neubig commented 9 months ago

Hi @manishshettym , thanks for pointing this out! We did some amount of prompt engineering and post-processing work, but would love to do better. We'll take a look at this ourselves, but would also welcome a PR on the code.

@yuzc19 worked on that part, any comments?

yuzc19 commented 9 months ago

Hi @manishshettym, thanks for your insightful comments. We will try to parse the code better / examine better prompts. As we evaluate the code generation in a zero-shot fashion, we cannot guarantee that every model can reply with the desired format, but we will try our best to mitigate this issue :). It could still be interesting to see that gpt3.5-turbo can more faithfully follow our instructions in some cases.

Naman-ntc commented 9 months ago

From personal experience, generating code in the instruct format (prompt used in this paper) works very well for most (API) models. On some internal evals on fresh leetcode style problems we do find similar performance trends so I imagine only the absolute numbers would vary.

Alternatively EvalPlus also has evaluated the models and provided the prompts used for HumanEval/HumanEval+

yuzc19 commented 9 months ago

Perfect! Thanks, @Naman-ntc I will try it.

DeepCreeper commented 3 months ago

From personal experience, generating code in the instruct format (prompt used in this paper) works very well for most (API) models. On some internal evals on fresh leetcode style problems we do find similar performance trends so I imagine only the absolute numbers would vary.

Alternatively EvalPlus also has evaluated the models and provided the prompts used for HumanEval/HumanEval+

I'm a newbie so I have some questions about the details of how to evaluate the code generation capabilities of LLM with some datasets like Humaneval. Are additional prompts allowed when we call the API? Because I haven't found anyone's description of this. It seems that the work of Humaneval/Humaneval+ requires us to implement the function like GEN_SOLUTION ourselves. Please let me know if my understanding is wrong. Looking forward to your reply.

yuzc19 commented 3 months ago

Hi @DeepCreeper,

Yes, we need to add both system and user prompts when we call the API for optimal performance.

You can find more information in this repo (and also in Appendix G in their paper). Let me know if you have further questions.