hendrycks / apps

APPS: Automated Programming Progress Standard (NeurIPS 2021)
MIT License
416 stars 55 forks source link

Problem in ground-truth solutions #29

Closed nicoladainese96 closed 5 months ago

nicoladainese96 commented 7 months ago

Hi, I'm encountering a problem in evaluating the solutions. For a preliminary pipeline in which I want to process all APPS benchmark with an LLM, I'm just taking one random solution among the available ones if present, otherwise using an empty solution. For the competition problems, test split, out of 1000 problems, only 311 have solutions, so in my case I should get a strict accuracy of 31.1% given that the solutions for the other 689 are left empty. However, I get the following results:

Test Case Average (average accuracy over problems) = 0.27318586602648753 Strict Accuracy (all test cases passed / total problems) = 0.263

Here's a screenshot of the last part of the evaluation script. Is it possible that certain solutions are only partially correct?

Thank you in advance for any help!

Screenshot 2024-04-29 at 14 35 30

xksteven commented 5 months ago

For the competition problems, test split, out of 1000 problems, only 311 have solutions,

So all of the test cases have "answers" which are unfortunately named input_output. So it is possible to get 100% on all of the test and train. Seeing the dataset in HF might provide some clarity.

xksteven commented 5 months ago

Feel free to reopen the issue though. If I have time I might reinvestigate the solution parser as mentioned in issue #27