Closed intellipy closed 1 month ago
Hi,
The logs/
and report.json
are auto-generated from running SWE-bench evaluation on the patches produced by AutoCodeRover. The SWE-bench evaluation checks whether a predefined list of tests can pass, and if all of the predefined tests pass, the issue is marked as resolved. The predefined tests may not include all the tests being executed. So it is normal that the issue is considered resolved when some tests are shown as failed in the test execution log.
For example, for astropy-6938, the predefined list of tests are the FAIL_TO_PASS
and PASS_TO_PASS
tests in the metadata here. If all of these tests pass, the evaluation harness will consider the issue as resolved.
Thanks, I have no more questions
Hi team,
I've noticed an inconsistency in the evaluation results and logs for resolved bugs.
In
results/acr-run-1/new_eval_results
, the resolved bugs listed inreport.json
andREADME.md
don’t seem to match up with the logs in thelogs/
folder. For example, the bugastropy__astropy-6938
is marked as resolved inreport.json
. However, when I checked the corresponding log file (results/acr-run-1/new_eval_results/logs/astropy__astropy-6938.gpt-4-0125-preview.eval.log
), which is linked inREADME.md
, it shows that some tests actually failed during evaluation.This issue seems to occur across multiple settings as well, with similar inconsistencies showing up quite frequently.
Could you take a look into this? Thanks!