Closed nicoladainese96 closed 5 months ago
For the competition problems, test split, out of 1000 problems, only 311 have solutions,
So all of the test cases have "answers" which are unfortunately named input_output
. So it is possible to get 100% on all of the test and train. Seeing the dataset in HF might provide some clarity.
Feel free to reopen the issue though. If I have time I might reinvestigate the solution parser as mentioned in issue #27
Hi, I'm encountering a problem in evaluating the solutions. For a preliminary pipeline in which I want to process all APPS benchmark with an LLM, I'm just taking one random solution among the available ones if present, otherwise using an empty solution. For the competition problems, test split, out of 1000 problems, only 311 have solutions, so in my case I should get a strict accuracy of 31.1% given that the solutions for the other 689 are left empty. However, I get the following results:
Test Case Average (average accuracy over problems) = 0.27318586602648753 Strict Accuracy (all test cases passed / total problems) = 0.263
Here's a screenshot of the last part of the evaluation script. Is it possible that certain solutions are only partially correct?
Thank you in advance for any help!