Merging #19 into master - a case study

ocramz commented 6 years ago

I had this lingering question : how to decide whether a PR introduces significant performance regression? Here are my notes, using #19 as a case study (@unhammer might be interested, too).

On my work laptop, a 2015 MBP 15", 2.2 GHz i7 with 16GB of RAM, I get these figures for the xeno tests with the largest dataset:

master :

benchmarking 211KB/xeno-sax
time                 218.7 μs   (214.9 μs .. 221.8 μs)
                     0.998 R²   (0.998 R² .. 0.999 R²)
mean                 215.8 μs   (213.5 μs .. 218.1 μs)
std dev              7.276 μs   (6.088 μs .. 8.812 μs)
variance introduced by outliers: 29% (moderately inflated)

benchmarking 211KB/xeno-dom
time                 535.4 μs   (525.7 μs .. 544.6 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 539.8 μs   (534.0 μs .. 547.0 μs)
std dev              21.60 μs   (19.07 μs .. 24.03 μs)
variance introduced by outliers: 33% (moderately inflated)

PR #19

benchmarking 211KB/xeno-sax
time                 225.9 μs   (223.6 μs .. 228.2 μs)
                     0.999 R²   (0.998 R² .. 0.999 R²)
mean                 228.0 μs   (225.7 μs .. 231.7 μs)
std dev              9.275 μs   (6.884 μs .. 12.33 μs)
variance introduced by outliers: 38% (moderately inflated)

benchmarking 211KB/xeno-dom
time                 542.3 μs   (533.4 μs .. 551.1 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 551.6 μs   (544.6 μs .. 557.6 μs)
std dev              22.42 μs   (19.93 μs .. 24.23 μs)
variance introduced by outliers: 34% (moderately inflated)

The sample size is the criterion default (since these benchmarks are run with defaultMain): n = 1000.

Using the Z-test (which assumes the samples are approximately Gaussian) to assess whether the timing difference is significant, I get this result:

xeno-dom :

-- mean benchmark time before patch
mu0d = 539.8e-6

-- standard deviation " "
sig0d = 21.6e-6

-- mean benchmark time after patch #19
mu1d = 551.6e-6

--  Z score
z_d = sqrt n * (mu1d - mu0d) / sig0d

z_d = 17.27, i.e. the mean benchmark after the patch is more than 17 standard errors larger than before the patch. For the xeno-sax benchmarks I get a Z-score of > 57 ; the probability of these values happening by accident (that is, the probability of a standard normal r.v. to yield a sample larger than Z) is extremely small, so we could say with some confidence that the patch introduces a regression.

Any thoughts?

ocramz commented 6 years ago

There are many factors that complicate this analysis however; the measured running times are influenced by the internals of the OS scheduler and the garbage collector, so the actual significance of these statistical tests is still questionable.

AndreasPK commented 6 years ago

Stumbling over this here are some thought in passing:

For statistical tests, while they have value they only test if the produced executable is faster/slower. Which usually correlates with performance/better code but not always.

It's quite possible for worse code to perform better (and the other way around) under specific circumstances. Things like functions growing beyond the inline threshold or code alignment changes can lead to surprising changes in performance.

I like to vary options for GHC and see how they affect performance. Things like -dunique-increment=[-1/1], -O[1/2]. If performance changes are the same across these that's a a good indicator.

For the linked PR it seems like it requires the parser to do more work (checking for whitespace) so it seem obvious to me that it will come at a small performance cost. So it seems like more of a question about weither or not the feature is worth the cost and how much cost it's worth.

ocramz commented 6 years ago

Hi @AndreasPK , thank you for your observations; I'll make a note of varying those GHC flags as well. It would be really great if criterion itself could recompile and benchmark distinct configurations of the binary, though that would make the analysis significantly more involved. Perhaps I should just merge the patch in master since 1. a few users requested it and 2. it seemingly increases the running time by an a small amount for a few users. @chrisdone what do you think?

chrisdone commented 6 years ago

I leave it up to your judgment, @ocramz :+1: :man_shrugging:

pkamenarsky commented 3 years ago

I've hit that limitation as well. If the unknown performance hit is blocking this, may I suggest a cabal/CPP flag? I would gladly submit a PR.

ocramz commented 3 years ago

Thank you Philip, I'd be happy to help

On Fri, 8 Jan 2021 at 23:04, Philip Kamenarsky notifications@github.com wrote:

I've hit that limitation as well. If the unknown performance hit is blocking this, may I suggest a cabal/CPP flag? I would gladly submit a PR.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ocramz/xeno/issues/22#issuecomment-757021648, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNBDKDR4RCKQU7PCNEDL4DSY56NHANCNFSM4FIOR7FA .

pkamenarsky commented 3 years ago

Done, see #44 - let me know what you think.

pkamenarsky commented 3 years ago

Do you need any help for merging this into master? I can open a PR, since I've already done the work here https://github.com/pkamenarsky/xeno.

ocramz commented 3 years ago

@pkamenarsky I've merged 44 and invited you and @mgajda as collaborators to the repo ^^

mgajda commented 3 years ago

@pkamenarsky Please compare with latest xeno in https://gitlab.com/migamake/xeno. We will likely merge performance-enhancing changes from there this month.

ocramz / xeno

Merging #19 into master - a case study #22