haf / expecto

A smooth testing lib for F#. APIs made for humans! Strong testing methodologies for everyone!
Apache License 2.0
663 stars 96 forks source link

add Expect.isSameTimeAs to benchmark #310

Open jackfoxy opened 5 years ago

jackfoxy commented 5 years ago

This complements isFasterThan. The 2 tests expect to execute in similar runtimes. An additional parameter of withinTimePercent triggers failure if one test is faster than the other by more than that percentage.

I believe there was a discussion about this on Twitter a couple months ago, but I do not recall the resolution. I thought there was already an issue here, but I could not find it.

I'm willing to work on this if maintainers agree on API and spec. (Someone may have a better name for this.)

jzabroski commented 5 years ago

This is a hot button topic for me.

I think Expect.isFasterThan is incorrectly named. While it's often how people think of performance, when trying to shave time off a method with micro-optimization via microbenchmarking, what we're really doing is what the medical community does when evaluating New Drug Applications or Biologics License Applications: Determine superiority, or determine non-inferiority. This is the very spirit of what Mann-Whitney U-tests seek to prove or disprove.

AnthonyLloyd commented 5 years ago

I tried the Mann-Whitney U-test but it came out to be unreliable. There was no obvious stopping criteria for equal. (see #302). Feel free to resurrect and see if was just me thinking about it incorrectly.

I'm happy for suggestions of better naming or new methods.

One thing I would say is it has to hold as a statistical test such that you can run it many times in units tests and not get false positives. 0.01% conf level was chosen so very roughly if you have 100 isFasterThan tests you'd get a false positive once in every 100 runs. (In theory anyway)

We'd have to think what that test is and stopping criteria for isSameTimeAs. Sounds like there should be one.

jzabroski commented 5 years ago

Yes, I've been learning more and more about how healthcare trials are performed. See for example: https://sph.uth.edu/courses/biometry/Lmoye/PH1820-21/PH1821/b1a.htm

In our world, we probably want multiple stopping rules just as they do in clinical trials: Our time dimension would be how long we are willing to let a test run for (Wall clock time) before we kill it. Our test subject dimension is simply test runs. The additional hidden variable that BenchmarkDotNet does not yet account for is hashing the DLL it benchmarks. I know this sounds obvious, but if you were to benchmark a DLL against itself at two different times, that would give you a baseline amount of noise in your test environment.

In general, I don't think we (developers) tend to think too hard about these things because usually when we have a human insight into performance that a compiler does not have, it's usually superior rather than non-inferior.

jackfoxy commented 5 years ago

Naively I just expect to re-use the Performance.timeCompare infrastructure used by isFasterThan, checking the resulting statistics execution times to see if they are within the range.

What is wrong with this approach?

AnthonyLloyd commented 5 years ago

I think you may need to code an example. I'm not clear at the moment how it would look.

jackfoxy commented 5 years ago

Will do. Give me a couple weeks. I'm deep in a lot of other things right now.