Closed martiansideofthemoon closed 5 months ago
Quick update on this. I independently generated text from code-davinci-002
with the same hyperparameters (p=0.95, t=0.8, n=200
). The results I got were better than the data available in this repository and similar to those in the paper.
{'pass@1': 0.20344512195121955, 'pass@10': 0.6839959515336559, 'pass@100': 0.91049547719406}
If I use a simple heuristic to truncate incomplete lines / truncate lines ending with a :
this rises to,
{'pass@1': 0.23917682926829273, 'pass@10': 0.7161137949932483, 'pass@100': 0.9262563823165132}
Finally, truncating code till it does not give errors on ast.parse
improves performance to,
{'pass@1': 0.3730487804878049, 'pass@10': 0.793568620027091, 'pass@100': 0.9432289234601097}
Dear authors, Thank you for releasing the data accompanying CodeT. However, I was having some trouble reproducing the Codex numbers reported in the paper. The original paper reports 47.0 @1, 74.9 @10, 92.1 @100 for Codex. I tried to calculate the scores using the data provided in the repository.
First, I converted it to a format compatible with the official OpenAI HumanEval script.
After running this through the OpenAI script, I get {'pass@1': 0.1653, 'pass@10': 0.6257, 'pass@100': 0.8537}. What could I be doing incorrectly here? Also, any reason to choose
n=100
instead ofn=200
like Chen et al. 2021 and CodeGen paper? I'm guessing the pass@100 estimation will be more accurate withn=200
?Thank you! Kalpesh