Closed ocramz closed 3 years ago
There are many factors that complicate this analysis however; the measured running times are influenced by the internals of the OS scheduler and the garbage collector, so the actual significance of these statistical tests is still questionable.
Stumbling over this here are some thought in passing:
For statistical tests, while they have value they only test if the produced executable is faster/slower. Which usually correlates with performance/better code but not always.
It's quite possible for worse code to perform better (and the other way around) under specific circumstances. Things like functions growing beyond the inline threshold or code alignment changes can lead to surprising changes in performance.
I like to vary options for GHC and see how they affect performance. Things like -dunique-increment=[-1/1], -O[1/2]. If performance changes are the same across these that's a a good indicator.
For the linked PR it seems like it requires the parser to do more work (checking for whitespace) so it seem obvious to me that it will come at a small performance cost. So it seems like more of a question about weither or not the feature is worth the cost and how much cost it's worth.
Hi @AndreasPK , thank you for your observations; I'll make a note of varying those GHC flags as well. It would be really great if criterion itself could recompile and benchmark distinct configurations of the binary, though that would make the analysis significantly more involved.
Perhaps I should just merge the patch in master
since 1. a few users requested it and 2. it seemingly increases the running time by an a small amount for a few users. @chrisdone what do you think?
I leave it up to your judgment, @ocramz :+1: :man_shrugging:
I've hit that limitation as well. If the unknown performance hit is blocking this, may I suggest a cabal/CPP flag? I would gladly submit a PR.
Thank you Philip, I'd be happy to help
On Fri, 8 Jan 2021 at 23:04, Philip Kamenarsky notifications@github.com wrote:
I've hit that limitation as well. If the unknown performance hit is blocking this, may I suggest a cabal/CPP flag? I would gladly submit a PR.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ocramz/xeno/issues/22#issuecomment-757021648, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNBDKDR4RCKQU7PCNEDL4DSY56NHANCNFSM4FIOR7FA .
Done, see #44 - let me know what you think.
Do you need any help for merging this into master
? I can open a PR, since I've already done the work here https://github.com/pkamenarsky/xeno.
@pkamenarsky I've merged 44 and invited you and @mgajda as collaborators to the repo ^^
@pkamenarsky Please compare with latest xeno
in https://gitlab.com/migamake/xeno.
We will likely merge performance-enhancing changes from there this month.
I had this lingering question : how to decide whether a PR introduces significant performance regression? Here are my notes, using #19 as a case study (@unhammer might be interested, too).
On my work laptop, a 2015 MBP 15", 2.2 GHz i7 with 16GB of RAM, I get these figures for the
xeno
tests with the largest dataset:master :
PR #19
The sample size is the
criterion
default (since these benchmarks are run withdefaultMain
):n = 1000
.Using the Z-test (which assumes the samples are approximately Gaussian) to assess whether the timing difference is significant, I get this result:
xeno-dom
:z_d = 17.27
, i.e. the mean benchmark after the patch is more than 17 standard errors larger than before the patch. For thexeno-sax
benchmarks I get a Z-score of > 57 ; the probability of these values happening by accident (that is, the probability of a standard normal r.v. to yield a sample larger than Z) is extremely small, so we could say with some confidence that the patch introduces a regression.Any thoughts?