pschanely / CrossHair

An analysis tool for Python that blurs the line between testing and type systems.
Other
996 stars 47 forks source link

Diagnose slow hypothesis tests #301

Open pschanely opened 4 weeks ago

pschanely commented 4 weeks ago

We're actually running low on important semantic issues, so I'd probably turn next to performance issues - there are some very slow tests that I've just skipped because they take several minutes, or hours, or don't finish at all within the 6h Actions time limit. Current commit with those is https://github.com/HypothesisWorks/hypothesis/pull/4034/commits/bb432101541c618762b1024b10e277658105baf8

Originally posted by @Zac-HD in https://github.com/pschanely/CrossHair/issues/292#issuecomment-2289529225

pschanely commented 3 weeks ago

@Zac-HD Done with one round of investigations and fixes! (hypothesis-crosshair 0.0.13 and crosshair-tool 0.0.68) Breaking up the tests into categories:

minimize() tests - should we never be running these with CrossHair? I see we try to avoid running it here, although some of the tests that are timing out appear to have _current_build_context.value == None under the crosshair testing profile. Affected tests:

Nested tests (used to time out, but now CrossHair crashes fast):

Just plain better, due to recent updates:

This gets a solver unsat here, ideally this will raise hypothesis.errors.BackendCannotProceed once that's available ... and would keep trying.

For me to continue to investigate:

Zac-HD commented 3 weeks ago

Sweet! I'll update shortly and see how it goes 😁

minimize() tests - should we never be running these with CrossHair? I see we try to avoid running it here, although some of the tests that are timing out appear to have _current_build_context.value == None under the crosshair testing profile.

That skip is specific to running minimal(...) _inside @given(), i.e. instead of crashing with the nested-tests error we just skip that test. On investigation there's a very obvious guess as to why it's super slow, too: minimal() has to find a counterexample before it can shrink that, and if Crosshair happens to not find one we can spend an awfully long time going through the max_examples=50_000 (!!!) iterations.

My action items:

pschanely commented 3 weeks ago

That skip is specific to running minimal(...) _inside @given(), i.e. instead of crashing with the nested-tests error we just skip that test. On investigation there's a very obvious guess as to why it's super slow, too: minimal() has to find a counterexample before it can shrink that, and if Crosshair happens to not find one we can spend an awfully long time going through the max_examples=50_000 (!!!) iterations.

Ah! I get it now. So I'll continue to look at those too.

Zac-HD commented 3 weeks ago

Very hacky, but... done. Well, maybe, I'll see what CI thinks after I sleep 😄

Zac-HD commented 3 weeks ago

OK, there are quite a few failures but I did mess around with the internals a bit and overall it's looking like an improvement. Quick triage:

pschanely commented 3 weeks ago

I got sidetracked on a few subtle-but-important issues this week, but this is still a big priority! I'll be looking at these in the next few days.

pschanely commented 3 weeks ago

0.0.70 fixes that ugly regression with unhashables!

The following two issues are going to take me a while, but, honestly, might legitimately be the next things to look at.

  • [ ] (Phillip) Unsatisfiable: Could not find any examples from floats() that satisfied lambda x: x + 1 == x and not math.isinf(x)

I'll need to tackle the issue of real float semantics for this one. High value, ~high effort.

  • [ ] (???) strategy reprs fail to omit default values under crosshair, e.g. here, unclear where that should be fixed. Might be me in LazyStrategy.__repr__ due to if k not in sig.parameters or v is not sig.parameters[k].default?

Your guess is right - the failing int bound is a symbolic. I have a branch on the backburner for more accurate identity semantics. I hadn't thought about faithful integer intern'ing as part of that (the biggest wins relate to avoiding path forks on enums and optional values), but I think I could make it happen.

As for whether the identity check on the hypothesis side is objectively correct, I'm unsure. Integer intern'ing isn't guaranteed across implementations. But in order for your test to fail, I think integer intern'ing would have to switch up mid-execution, which seems pretty unlikely.

Zac-HD commented 3 weeks ago

For the identity check, I'm not (deliberately) checking interned integers; but rather whether the value is the default argument - so that we show even the default value if it was explictly passed. Integers usually defeat that, but after a quick investigation I've classified this as "not expected to pass under crosshair" so we're done.