hendrycks / apps

APPS: Automated Programming Progress Standard (NeurIPS 2021)
MIT License
416 stars 55 forks source link

Fix memory leak and add reliabilty guard #16

Closed loubnabnl closed 2 years ago

loubnabnl commented 2 years ago

Hi, this fixes the memory leak issue mentioned #13 and adds HumanEval's reliabilty guard to limit the harm that can be caused by running untrusted code.

I also fixed the issue raised here #14 where a NaN score is returned, it's because when this exception is raised https://github.com/hendrycks/apps/blob/1b052764e10804ae79cf12c24801aaa818ea36ab/eval/testing_util.py#L229 the results list returned is empty so the mean computation over it returns NaN, we can consider this as a compilation error and fill the results list with -2 value:https://github.com/hendrycks/apps/blob/1b052764e10804ae79cf12c24801aaa818ea36ab/eval/test_one_solution.py#L39

Please note that I've only ran the tests with this new timeout setup on Hugging Face APPS metric and it didn't seem to affect the performance. I currently have some software issues in reading the standard data folder so It would be great if you can test more the new changes with this repo, especially since I changed what's returned in case there is a global timeout, I think it should be [-1] * number_test_cases as in here https://huggingface.co/spaces/codeparrot/apps_metric/blob/main/utils.py#L29 but I used 21 the average number of test cases for the test set as it's not straightforward to access the number of tests in this setting.

loubnabnl commented 2 years ago

Btw HF APPS metric is ready for use https://huggingface.co/spaces/codeparrot/apps_metric 🚀 Below are the scores we get using it on your finetuned GPT2 1.5B model with top-p sampling of 0.95 and temperature 0.2 image Now you can also compute the pass@k metrics on APPS in case of multiple generations

xksteven commented 2 years ago

Thanks so much!