Closed loubnabnl closed 2 years ago
Btw HF APPS metric is ready for use https://huggingface.co/spaces/codeparrot/apps_metric 🚀 Below are the scores we get using it on your finetuned GPT2 1.5B model with top-p sampling of 0.95 and temperature 0.2 Now you can also compute the pass@k metrics on APPS in case of multiple generations
Thanks so much!
Hi, this fixes the memory leak issue mentioned #13 and adds HumanEval's reliabilty guard to limit the harm that can be caused by running untrusted code.
I also fixed the issue raised here #14 where a NaN score is returned, it's because when this exception is raised https://github.com/hendrycks/apps/blob/1b052764e10804ae79cf12c24801aaa818ea36ab/eval/testing_util.py#L229 the
results
list returned is empty so themean
computation over it returns NaN, we can consider this as a compilation error and fill the results list with -2 value:https://github.com/hendrycks/apps/blob/1b052764e10804ae79cf12c24801aaa818ea36ab/eval/test_one_solution.py#L39Please note that I've only ran the tests with this new timeout setup on Hugging Face APPS metric and it didn't seem to affect the performance. I currently have some software issues in reading the standard data folder so It would be great if you can test more the new changes with this repo, especially since I changed what's returned in case there is a global timeout, I think it should be
[-1] * number_test_cases
as in here https://huggingface.co/spaces/codeparrot/apps_metric/blob/main/utils.py#L29 but I used 21 the average number of test cases for the test set as it's not straightforward to access the number of tests in this setting.