Failing benchmark instances

aorwall commented 2 days ago

Great job with the new containerized evaluation tool! I've run it a couple of times on the golden patches on SWE-bench Lite and overall it gives a more stable result than my swe-bench-docker setup. There are a few instances that fail intermittently, though. Some I recognize from tests in swe-bench-docker, and some are new. None of them are failing in 100% of the runs.

Django instances

In all the failing Django instances I've checked, the tests seem to pass but are marked as failed because other logs are being printed in the test results.

Here's an example of a test that is marked as failed:

test_annotation_with_nested_outerref (expressions.tests.BasicExpressionsTests) ... System check identified no issues (0 silenced).
ok

The same test in a successful test output log

test_annotation_with_nested_outerref (expressions.tests.BasicExpressionsTests) ... ok

Other instances

In the following instances some different tests fails intermittently and I haven't found the root cause. I got the same issues in swe-bench-docker with matplotlib and sympy instances. I haven't got issues with psf__requests though.

matplotlib__matplotlib-23987
psf__requests-1963
psf__requests-2317
psf__requests-2674
sympy__sympy-13177
sympy__sympy-13146

Have you experienced the same issues? Is it also be possible for you to share your run_instance_logs somewhere to compare to your successful evaluation runs. Would be nice to nail this once and for all :)

I've run the benchmarks on Ubuntu 22 VMs with 16 cores on Azure (max_workers = 14)

ofirpress commented 2 days ago

thanks for making an issue about this

john-b-yang commented 2 days ago

Hi @aorwall! Thanks so much for the kind words, swe-bench-docker was a huge inspiration for our release. Really appreciate all the past and ongoing work w/ Moatless + SWE-bench evals 😄

Ok so regarding the issue...

Django test parsing issues: Yeah agreed, we've seen this too. The Django log parsing just got updated at #166 most recently, with a couple changes before that too. I think it should cover the case you're talking about? - I'll check.
Intermittently failing tests: We've been noticing this too. Our approach that we feel is reasonable is to just remove flaky tests if they're P2P ones, which has been the large majority so far. I will take a look at these issues, run them 5x, and apply the aforementioned clean up.
run_instance_logs for lite: Linked here!

I'm actively working on 2, have gotten some help from Stanford folks as well on this - I think you can expect a dataset update that addresses these problems by end of next week at the latest!

aorwall commented 2 days ago

Looks like #166 fixed the Django isses :+1:

princeton-nlp / SWE-bench

Failing benchmark instances #167

Django instances

Other instances