nus-apr / auto-code-rover

A project structure aware autonomous software engineer aiming for autonomous program improvement. Resolved 30.67% tasks (pass@1) in SWE-bench lite and 38.40% tasks (pass@1) in SWE-bench verified with each task costs less than $0.7.
Other
2.72k stars 288 forks source link

Inconsistency between `eval_results` and log files #68

Closed intellipy closed 1 month ago

intellipy commented 1 month ago

Hi team,

I've noticed an inconsistency in the evaluation results and logs for resolved bugs.

In results/acr-run-1/new_eval_results, the resolved bugs listed in report.json and README.md don’t seem to match up with the logs in the logs/ folder. For example, the bug astropy__astropy-6938 is marked as resolved in report.json. However, when I checked the corresponding log file (results/acr-run-1/new_eval_results/logs/astropy__astropy-6938.gpt-4-0125-preview.eval.log), which is linked in README.md, it shows that some tests actually failed during evaluation.

This issue seems to occur across multiple settings as well, with similar inconsistencies showing up quite frequently.

Could you take a look into this? Thanks!

yuntongzhang commented 1 month ago

Hi,

The logs/ and report.json are auto-generated from running SWE-bench evaluation on the patches produced by AutoCodeRover. The SWE-bench evaluation checks whether a predefined list of tests can pass, and if all of the predefined tests pass, the issue is marked as resolved. The predefined tests may not include all the tests being executed. So it is normal that the issue is considered resolved when some tests are shown as failed in the test execution log.

For example, for astropy-6938, the predefined list of tests are the FAIL_TO_PASS and PASS_TO_PASS tests in the metadata here. If all of these tests pass, the evaluation harness will consider the issue as resolved.

intellipy commented 1 month ago

Thanks, I have no more questions