hendrycks / apps

APPS: Automated Programming Progress Standard (NeurIPS 2021)
MIT License
416 stars 55 forks source link

check5 in function "run_test" seem to bring some wrong result #27

Open Zyq-scut opened 1 year ago

Zyq-scut commented 1 year ago

Hi, thanks for your work. I don't quite understand the role of check5 in the evaluating process, it seems to bring some wrong results. Here is an example of 4496 test problem. The question is: image My program is: image When I pass 22 into the program, the ideal return result is “Christmas Eve Eve Eve”, but this program returns “Christmas Eve”. Obviously, this is a wrong answer, but check5 in the “run_test” function judges the result as correct. image Is it a bug? Looking forward to your reply.

henryhungle commented 1 year ago

I also noticed a few samples with false positive results due to check5. e.g. a generated program outputs ['NO', 'YES', 'YES'] and the ground truth outputs as 'YES\nNO\nYES\n' still lead to a passing result after check5 (before that it was correctly identified as as a failure result).

The issue seems to be the transformation to set data type: https://github.com/hendrycks/apps/blob/23e98569b871e67752422717d1e28e9193248194/eval/testing_util.py#L448

check6 might have a similar problem but I didn't see any such false positive cases yet.

xksteven commented 1 year ago

I'm not sure when I'll have time to actively help debug this. I can certainly review PRs or assist others though.