Closed manishshettym closed 3 months ago
Hi @manishshettym , thanks for pointing this out! We did some amount of prompt engineering and post-processing work, but would love to do better. We'll take a look at this ourselves, but would also welcome a PR on the code.
@yuzc19 worked on that part, any comments?
Hi @manishshettym, thanks for your insightful comments. We will try to parse the code better / examine better prompts. As we evaluate the code generation in a zero-shot fashion, we cannot guarantee that every model can reply with the desired format, but we will try our best to mitigate this issue :). It could still be interesting to see that gpt3.5-turbo can more faithfully follow our instructions in some cases.
From personal experience, generating code in the instruct format (prompt used in this paper) works very well for most (API) models. On some internal evals on fresh leetcode style problems we do find similar performance trends so I imagine only the absolute numbers would vary.
Alternatively EvalPlus also has evaluated the models and provided the prompts used for HumanEval/HumanEval+
Perfect! Thanks, @Naman-ntc I will try it.
From personal experience, generating code in the instruct format (prompt used in this paper) works very well for most (API) models. On some internal evals on fresh leetcode style problems we do find similar performance trends so I imagine only the absolute numbers would vary.
Alternatively EvalPlus also has evaluated the models and provided the prompts used for HumanEval/HumanEval+
I'm a newbie so I have some questions about the details of how to evaluate the code generation capabilities of LLM with some datasets like Humaneval. Are additional prompts allowed when we call the API? Because I haven't found anyone's description of this. It seems that the work of Humaneval/Humaneval+ requires us to implement the function like GEN_SOLUTION ourselves. Please let me know if my understanding is wrong. Looking forward to your reply.
Hi! Thanks for this great and extensive benchmarking work. I was looking at Section 6 in the paper and found this graph intriguing since it states that
gpt3.5-turbo
is better thangpt4-turbo
on the ODEX benchmark. It was even more interesting because this is on slices of the dataset that includes very popular libraries: pandas, numpy, etcIssue: Code not extracted from LM response
From the zenoml dashboard, it looks like code from
gpt4-turbo
might be correct (in fact same as 3.5-turbo) but failed because there is an NL description after the code, causing a syntax error.Question: Should we be prompting better (e.g., with code delimiters ```) and parsing the code from the response before execution and evaluation?