openai / human-eval

Code for the paper "Evaluating Large Language Models Trained on Code"
MIT License
2.31k stars 330 forks source link

pass@k on filtered samples #13

Open henryhungle opened 2 years ago

henryhungle commented 2 years ago

Hi,

Thank you for the great work!

I have 2 questions about the computation of the pass@k metric after applying filtering on the APPS benchmark.

  1. Will the total array in the below code snippet contain numbers of filtered samples that passed the example test cases (from problem statement), i.e. each number <= N_original_samples(=1000)? https://github.com/openai/human-eval/blob/312c5e5532f0e0470bf47f77a6243e02a61da530/human_eval/evaluation.py#L85

  2. In the cases when a number of filtered samples is less than k (=[1,5]), how do you compute the pass@k metric for these cases? For example, when N_filtered_samples = 1 and k=5, can we assume execution results of 4 failures and 1 passed/failure (depending on the final unit test results of this filtered sample)?