Open phadej opened 2 years ago
Hm. I'm not quite sure what to make of this, since the choice of performMinorGC
was made deliberately in #187 after some careful investigation. I'm a bit reluctant to simply revert that change without a consensus that this is a clear win. (cc'ing @patrickdoc, who originally investigated this in #187)
The original issue was that “fast” functions took an extremely long time to benchmark. The ticket should have some examples.
If I remember correctly, this blog post served as the main source of inspiration. https://blog.janestreet.com/core_bench-micro-benchmarking-for-ocaml/
The gist of the argument is something along the lines of the GC will always affect the numbers, and the goal is to collect enough stats to mitigate the effect.
In #187 both pre-benchmark performGC
and after the task performGC
were changed to performMinorGC
. I say that only the pre-measurement GC should be performMajorGC
. In fact, in 99% times I run benchmarks I don't care about GC statistics, (I don't set +RTS -T
) so the after GCs are wasted work to begin with.
The claim in #187 in that GC is performed 4 times per sample is a red flag. Only once should be enough for time measurement, and that one can be major.
EDIT: There were two GCs in runBenchmarkable
. These are not per sample, they are per benchmark. The major GC before each benchmark is still justified even before each sample only minor is run. This can easily be a reason (out of many) for #166.
I'm also confused why the reported times are different in #187 in before and after. If the claim is that GC affects only the performance of GC suite, not the performance of the benchmarks themselves!
before
benchmarking small-fast
bootstrapping with 39 of 312 samples (12%)
time 6.877 ns (6.724 ns .. 7.062 ns)
0.996 R² (0.994 R² .. 0.997 R²)
mean 6.898 ns (6.758 ns .. 7.034 ns)
std dev 448.3 ps (391.7 ps .. 540.9 ps)
and after
benchmarking small-fast
bootstrapping with 42 of 316 samples (13%)
time 5.902 ns (5.888 ns .. 5.924 ns)
0.999 R² (0.999 R² .. 1.000 R²)
mean 6.015 ns (5.944 ns .. 6.119 ns)
std dev 281.0 ps (210.5 ps .. 338.7 ps)
What am I missing?
I want also remind that setting nursery size is very significant.
For example running maps
benchmark, with -m pattern random
(I show only one results, there are five benchmarks run though).
With large nursery, virtually every major GC would come only from runner explicitly asking for them; also the old generation consists (hopefully) only the criterion's own bookkkeeping data, not any benchmark generated garbage: i.e. not much. (Not worth compacting, for example).
Using major GC makes criterion spend more time in GC,
but if nursery is appropriately sized, the time increase is acceptable.
OTOH, if you don't increase nursery size, benchmark results
are significantly affected by GCs.
(Here we should consider using -T
and use collections as an additional
independent variable, as Jane street's blog post suggests).
I'm still puzzled why using major GC before measurements
affect the results, but that's only the first benchmark, so might
be something to do with that. (In fact, I think I experienced something
similar with tasty-bench
: first benchmark results are less stable).
benchmarking Int/IntMap/random
time 16.01 ms (15.88 ms .. 16.17 ms)
0.999 R² (0.999 R² .. 1.000 R²)
mean 15.83 ms (15.76 ms .. 15.95 ms)
std dev 224.5 μs (177.1 μs .. 290.6 μs)
benchmarking Int/Map/random
time 24.44 ms (24.13 ms .. 24.87 ms)
0.999 R² (0.997 R² .. 1.000 R²)
mean 24.94 ms (24.73 ms .. 25.23 ms)
std dev 553.6 μs (390.0 μs .. 767.2 μs)
...
Tot time (elapsed) Avg pause Max pause
Gen 0 42218 colls, 0 par 9.425s 9.431s 0.0002s 0.0022s
Gen 1 2210 colls, 0 par 2.769s 2.771s 0.0013s 0.0044s
25,96s user 0,12s system 99% cpu 26,092 total
benchmarking Int/IntMap/random
time 16.03 ms (15.83 ms .. 16.22 ms)
0.999 R² (0.999 R² .. 1.000 R²)
mean 16.36 ms (16.25 ms .. 16.51 ms)
std dev 328.6 μs (250.9 μs .. 541.9 μs)
benchmarking Int/Map/random
time 25.09 ms (25.01 ms .. 25.25 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 25.03 ms (24.97 ms .. 25.12 ms)
std dev 159.5 μs (118.5 μs .. 215.5 μs)
...
Tot time (elapsed) Avg pause Max pause
Gen 0 41485 colls, 0 par 9.430s 9.431s 0.0002s 0.0016s
Gen 1 2299 colls, 0 par 2.951s 2.952s 0.0013s 0.0036s
benchmarking Int/IntMap/random
time 8.505 ms (8.253 ms .. 8.869 ms)
0.989 R² (0.976 R² .. 0.996 R²)
mean 9.813 ms (9.535 ms .. 10.22 ms)
std dev 976.6 μs (654.1 μs .. 1.641 ms)
benchmarking Int/Map/random
time 17.21 ms (17.08 ms .. 17.38 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 16.86 ms (16.77 ms .. 16.96 ms)
std dev 228.9 μs (192.4 μs .. 274.7 μs)
...
Tot time (elapsed) Avg pause Max pause
Gen 0 1311 colls, 0 par 1.388s 1.387s 0.0011s 0.0061s
Gen 1 8 colls, 0 par 0.006s 0.006s 0.0007s 0.0015s
25,77s user 0,04s system 99% cpu 25,813 total
benchmarking Int/IntMap/random
time 9.246 ms (8.792 ms .. 9.867 ms)
0.987 R² (0.979 R² .. 0.997 R²)
mean 10.05 ms (9.832 ms .. 10.48 ms)
std dev 795.8 μs (413.7 μs .. 1.474 ms)
...
benchmarking Int/Map/random
time 17.38 ms (17.23 ms .. 17.59 ms)
0.999 R² (0.999 R² .. 1.000 R²)
mean 16.92 ms (16.78 ms .. 17.05 ms)
std dev 336.7 μs (285.7 μs .. 412.9 μs)
Tot time (elapsed) Avg pause Max pause
Gen 0 1090 colls, 0 par 1.466s 1.466s 0.0013s 0.0061s
Gen 1 192 colls, 0 par 0.117s 0.117s 0.0006s 0.0015s
Looks like if the first benchmark is of "right" size, its execution time is underestimated as GC doesn't start to kick in yet:
EDIT: Doesn't seem to matter whether I do minor or major GC before the measurement.
With random
first:
benchmarking Int/IntMap/random
time 8.705 ms (8.640 ms .. 8.771 ms)
1.000 R² (0.999 R² .. 1.000 R²)
mean 8.895 ms (8.807 ms .. 9.141 ms)
std dev 399.1 μs (108.0 μs .. 855.1 μs)
variance introduced by outliers: 21% (moderately inflated)
benchmarking Int/IntMap/sorted
time 2.395 ms (2.363 ms .. 2.441 ms)
0.998 R² (0.998 R² .. 0.999 R²)
mean 2.527 ms (2.503 ms .. 2.548 ms)
std dev 79.49 μs (66.39 μs .. 97.10 μs)
variance introduced by outliers: 16% (moderately inflated)
(The shape of random
time densitivies seems to be quite reproducible on my machine)
With random
second:
benchmarking Int/IntMap/sorted
time 2.424 ms (2.403 ms .. 2.456 ms)
0.999 R² (0.999 R² .. 0.999 R²)
mean 2.520 ms (2.500 ms .. 2.544 ms)
std dev 72.98 μs (53.98 μs .. 114.8 μs)
variance introduced by outliers: 14% (moderately inflated)
benchmarking Int/IntMap/random
time 9.031 ms (8.990 ms .. 9.075 ms)
1.000 R² (1.000 R² .. 1.000 R²)
mean 9.107 ms (9.069 ms .. 9.150 ms)
std dev 111.7 μs (85.06 μs .. 145.9 μs)
The major GC before each benchmark is still justified even before each sample only minor is run.
Yes, this definitely sounds correct. Although based on your samples it seems like even putting a major GC before every loop doesn't completely eliminate the issue.
I'm also confused why the reported times are different in #187 in before and after.
I would guess that at the time I interpreted < 1ns as within the "noise" range of measurement.
Using major GC makes criterion spend more time in GC, but if nursery is appropriately sized, the time increase is acceptable. OTOH, if you don't increase nursery size, benchmark results are significantly affected by GCs.
To some extent, allocation + GC time are a component of a given function's performance. The benchmark captures that by letting GC's naturally occur and then smoothing out the curve to include them.
The benchmark is looking to collect a certain threshold of acceptable measurements by progressively increasing the number of executions per measurement: https://github.com/haskell/criterion/blob/52ef4a7699237a8952475146c51fc5d8ebff5876/Criterion/Measurement.hs#L220
If your function takes 5ms to run, then the benchmark will only include measurements that include > 6 executions (>30ms total). It will continue increasing executions until the sum of all the valid measurements is > 300ms (10 * 30ms).
What this means is that anything above a certain ratio of allocations/time, you'll always trigger GCs. Increasing the nursery size increases the minimum ratio that would trigger GCs, but the "right" value depends on your function.
(Here we should consider using
-T
and use collections as an additional independent variable, as Jane street's blog post suggests).
Looks like I left a related comment when looking into the stats available from the GC result: https://github.com/haskell/criterion/pull/187#discussion_r170494245 . I think it is possible to extract mutator time + gc time to potentially eliminate GC from the measurement, but I'm not sure that gives the best picture of performance.
I believe we've traded precision (consistent, mutator-only measurements) for accuracy (slightly less precise, but adding in a measured impact from GC).
The original concern was that the lack of major GC's was skewing the numbers, but your recent results are showing that it might be an issue about the earlier benchmarks affecting later ones? Is that the current state of the investigation?
I disagree.
To some extent, allocation + GC time are a component of a given function's performance.
Allocation is indeed a component of function's performance, and I don't think it makes sense to try to isolate that. (Allocation rate and time are correlated). However, GC occuring or not is a global property of a program and RTS settings.
So I interpret this as criterion
doesn't try to minimize the effect of GC, and only runs performMinorGC
to update GC statistics. Then I'll make a PR to do GC only if GC stats are available (i.e. +RTS -T
, using getRTSStatsEnabled
I'm remembering now that the GC stats are presented as a diff since the last one. So the pre-loop GC acts as a baseline, and the post-loop GC is used to update GC stats.
What's the benefit of skipping those minor GCs? Just the benchmark running faster, or are you expecting the results to be affected?
I expect to result be different as GCs will become less deterministic, but that is good isn't it, You said
To some extent, allocation + GC time are a component of a given function's performance.
We shouldn't favor small inputs (which may run without needing to do a single minor GCs), and instead let them (non-deterministically) have some too.
Yeah, I suppose that is consistent with the rest of the thinking.
In 1.4.0.0 the
performGC
was changed toperformMinorGC
While that is true, it also "destroys" accuracy of a bit slower functions. In case when
measure
(which allocates relatively bigMeasure
object on each iteration, and there are enough iterations, triggers a GC which must be major: that run is a massive outlier.IMHO, it's simpler just to accept that GC have to be run, this is not right place to optimize for performance.