Diagnose slow hypothesis tests

pschanely commented 4 weeks ago

We're actually running low on important semantic issues, so I'd probably turn next to performance issues - there are some very slow tests that I've just skipped because they take several minutes, or hours, or don't finish at all within the 6h Actions time limit. Current commit with those is https://github.com/HypothesisWorks/hypothesis/pull/4034/commits/bb432101541c618762b1024b10e277658105baf8

Originally posted by @Zac-HD in https://github.com/pschanely/CrossHair/issues/292#issuecomment-2289529225

pschanely commented 3 weeks ago

@Zac-HD Done with one round of investigations and fixes! (hypothesis-crosshair 0.0.13 and crosshair-tool 0.0.68) Breaking up the tests into categories:

minimize() tests - should we never be running these with CrossHair? I see we try to avoid running it here, although some of the tests that are timing out appear to have _current_build_context.value == None under the crosshair testing profile. Affected tests:

test_can_generate_both_zeros_when_in_interval
test_mutual_recursion
test_hashable_type_unhashable_value
many tests in tests/nocover/test_simple_numbers.py

Nested tests (used to time out, but now CrossHair crashes fast):

test_nesting_with_control_passes_health_check

Just plain better, due to recent updates:

test_drawing_many_near_boundary (though still slowish)
test_lookup_knows_about_all_core_strategies
test_fullmatch_generates_example

This gets a solver unsat here, ideally this will raise hypothesis.errors.BackendCannotProceed once that's available ... and would keep trying.

test_hard_to_find_single_example

For me to continue to investigate:

tests in tests/nocover/test_health_checks.py - not sure I'll be able to get too_large_strategy to work, but will poke at it a little. The other tests complete after ~minutes.
tests in tests/nocover/test_type_lookup_forward_ref.py no longer times out, but now yields a strange SimpleDict object is not callable error.

Zac-HD commented 3 weeks ago

Sweet! I'll update shortly and see how it goes 😁

minimize() tests - should we never be running these with CrossHair? I see we try to avoid running it here, although some of the tests that are timing out appear to have _current_build_context.value == None under the crosshair testing profile.

That skip is specific to running minimal(...) _inside @given(), i.e. instead of crashing with the nested-tests error we just skip that test. On investigation there's a very obvious guess as to why it's super slow, too: minimal() has to find a counterexample before it can shrink that, and if Crosshair happens to not find one we can spend an awfully long time going through the max_examples=50_000 (!!!) iterations.

My action items:

[x] turn that down globally, maybe even lower for crosshair, mark affected tests with Why.undiscovered to investigate later
[x] unskip tests we expect are fixed now, update deps, and trigger CI
[x] maybe hack together a spike for hypothesis.errors.BackendCannotProceed?

pschanely commented 3 weeks ago

That skip is specific to running minimal(...) _inside @given(), i.e. instead of crashing with the nested-tests error we just skip that test. On investigation there's a very obvious guess as to why it's super slow, too: minimal() has to find a counterexample before it can shrink that, and if Crosshair happens to not find one we can spend an awfully long time going through the max_examples=50_000 (!!!) iterations.

Ah! I get it now. So I'll continue to look at those too.

Zac-HD commented 3 weeks ago

Very hacky, but... done. Well, maybe, I'll see what CI thinks after I sleep 😄

Zac-HD commented 3 weeks ago

OK, there are quite a few failures but I did mess around with the internals a bit and overall it's looking like an improvement. Quick triage:

[x] (Zac) pull out and merge the cleanups to minimal() etc https://github.com/HypothesisWorks/hypothesis/pull/4090
[ ] (Zac) extract and test BackendCannotProceed changes to a separate PR
[x] (Phillip) fix TypeError: unhashable type: 'tzutc' ('Rule', etc) regressions, failing on Crosshair backend but not Hypothesis
[ ] (Phillip) Unsatisfiable: Could not find any examples from floats() that satisfied lambda x: x + 1 == x and not math.isinf(x)
[x] (???) strategy reprs fail to omit default values under crosshair, e.g. here, unclear where that should be fixed. Might be me in LazyStrategy.__repr__ due to if k not in sig.parameters or v is not sig.parameters[k].default?

pschanely commented 3 weeks ago

I got sidetracked on a few subtle-but-important issues this week, but this is still a big priority! I'll be looking at these in the next few days.

pschanely commented 3 weeks ago

0.0.70 fixes that ugly regression with unhashables!

The following two issues are going to take me a while, but, honestly, might legitimately be the next things to look at.

[ ] (Phillip) Unsatisfiable: Could not find any examples from floats() that satisfied lambda x: x + 1 == x and not math.isinf(x)

I'll need to tackle the issue of real float semantics for this one. High value, ~high effort.

[ ] (???) strategy reprs fail to omit default values under crosshair, e.g. here, unclear where that should be fixed. Might be me in LazyStrategy.__repr__ due to if k not in sig.parameters or v is not sig.parameters[k].default?

Your guess is right - the failing int bound is a symbolic. I have a branch on the backburner for more accurate identity semantics. I hadn't thought about faithful integer intern'ing as part of that (the biggest wins relate to avoiding path forks on enums and optional values), but I think I could make it happen.

As for whether the identity check on the hypothesis side is objectively correct, I'm unsure. Integer intern'ing isn't guaranteed across implementations. But in order for your test to fail, I think integer intern'ing would have to switch up mid-execution, which seems pretty unlikely.

Zac-HD commented 3 weeks ago

For the identity check, I'm not (deliberately) checking interned integers; but rather whether the value is the default argument - so that we show even the default value if it was explictly passed. Integers usually defeat that, but after a quick investigation I've classified this as "not expected to pass under crosshair" so we're done.

pschanely / CrossHair

Diagnose slow hypothesis tests #301