use performMajorGC - Githubissues

phadej commented 2 years ago

In 1.4.0.0 the performGC was changed to performMinorGC

Use performMinorGC rather than performGC to update garbage collection statistics. This improves the benchmark performance of fast functions on large objects.

While that is true, it also "destroys" accuracy of a bit slower functions. In case when measure (which allocates relatively big Measure object on each iteration, and there are enough iterations, triggers a GC which must be major: that run is a massive outlier.

IMHO, it's simpler just to accept that GC have to be run, this is not right place to optimize for performance.

RyanGlScott commented 2 years ago

Hm. I'm not quite sure what to make of this, since the choice of performMinorGC was made deliberately in #187 after some careful investigation. I'm a bit reluctant to simply revert that change without a consensus that this is a clear win. (cc'ing @patrickdoc, who originally investigated this in #187)

patrickdoc commented 2 years ago

The original issue was that “fast” functions took an extremely long time to benchmark. The ticket should have some examples.

If I remember correctly, this blog post served as the main source of inspiration. https://blog.janestreet.com/core_bench-micro-benchmarking-for-ocaml/

The gist of the argument is something along the lines of the GC will always affect the numbers, and the goal is to collect enough stats to mitigate the effect.

phadej commented 2 years ago

In #187 both pre-benchmark performGC and after the task performGC were changed to performMinorGC. I say that only the pre-measurement GC should be performMajorGC. In fact, in 99% times I run benchmarks I don't care about GC statistics, (I don't set +RTS -T) so the after GCs are wasted work to begin with.

The claim in #187 in that GC is performed 4 times per sample is a red flag. Only once should be enough for time measurement, and that one can be major.

EDIT: There were two GCs in runBenchmarkable. These are not per sample, they are per benchmark. The major GC before each benchmark is still justified even before each sample only minor is run. This can easily be a reason (out of many) for #166.

phadej commented 2 years ago

I'm also confused why the reported times are different in #187 in before and after. If the claim is that GC affects only the performance of GC suite, not the performance of the benchmarks themselves!

before

benchmarking small-fast
bootstrapping with 39 of 312 samples (12%)
time                 6.877 ns   (6.724 ns .. 7.062 ns)
                     0.996 R²   (0.994 R² .. 0.997 R²)
mean                 6.898 ns   (6.758 ns .. 7.034 ns)
std dev              448.3 ps   (391.7 ps .. 540.9 ps)

and after

benchmarking small-fast
bootstrapping with 42 of 316 samples (13%)
time                 5.902 ns   (5.888 ns .. 5.924 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 6.015 ns   (5.944 ns .. 6.119 ns)
std dev              281.0 ps   (210.5 ps .. 338.7 ps)

What am I missing?

phadej commented 2 years ago

I want also remind that setting nursery size is very significant.

For example running maps benchmark, with -m pattern random (I show only one results, there are five benchmarks run though).

With large nursery, virtually every major GC would come only from runner explicitly asking for them; also the old generation consists (hopefully) only the criterion's own bookkkeeping data, not any benchmark generated garbage: i.e. not much. (Not worth compacting, for example).

Using major GC makes criterion spend more time in GC, but if nursery is appropriately sized, the time increase is acceptable. OTOH, if you don't increase nursery size, benchmark results are significantly affected by GCs. (Here we should consider using -T and use collections as an additional independent variable, as Jane street's blog post suggests).

I'm still puzzled why using major GC before measurements affect the results, but that's only the first benchmark, so might be something to do with that. (In fact, I think I experienced something similar with tasty-bench: first benchmark results are less stable).

Default nursery size:

With minor GC

benchmarking Int/IntMap/random
time                 16.01 ms   (15.88 ms .. 16.17 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 15.83 ms   (15.76 ms .. 15.95 ms)
std dev              224.5 μs   (177.1 μs .. 290.6 μs)

benchmarking Int/Map/random
time                 24.44 ms   (24.13 ms .. 24.87 ms)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 24.94 ms   (24.73 ms .. 25.23 ms)
std dev              553.6 μs   (390.0 μs .. 767.2 μs)

...

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     42218 colls,     0 par    9.425s   9.431s     0.0002s    0.0022s
  Gen  1      2210 colls,     0 par    2.769s   2.771s     0.0013s    0.0044s

25,96s user 0,12s system 99% cpu 26,092 total

With major GC:

benchmarking Int/IntMap/random
time                 16.03 ms   (15.83 ms .. 16.22 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 16.36 ms   (16.25 ms .. 16.51 ms)
std dev              328.6 μs   (250.9 μs .. 541.9 μs)

benchmarking Int/Map/random
time                 25.09 ms   (25.01 ms .. 25.25 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 25.03 ms   (24.97 ms .. 25.12 ms)
std dev              159.5 μs   (118.5 μs .. 215.5 μs)

...

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     41485 colls,     0 par    9.430s   9.431s     0.0002s    0.0016s
  Gen  1      2299 colls,     0 par    2.951s   2.952s     0.0013s    0.0036s

-A64m

With minor GC

benchmarking Int/IntMap/random
time                 8.505 ms   (8.253 ms .. 8.869 ms)
                     0.989 R²   (0.976 R² .. 0.996 R²)
mean                 9.813 ms   (9.535 ms .. 10.22 ms)
std dev              976.6 μs   (654.1 μs .. 1.641 ms)

benchmarking Int/Map/random
time                 17.21 ms   (17.08 ms .. 17.38 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 16.86 ms   (16.77 ms .. 16.96 ms)
std dev              228.9 μs   (192.4 μs .. 274.7 μs)

...

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1311 colls,     0 par    1.388s   1.387s     0.0011s    0.0061s
  Gen  1         8 colls,     0 par    0.006s   0.006s     0.0007s    0.0015s

25,77s user 0,04s system 99% cpu 25,813 total

With major GC

benchmarking Int/IntMap/random
time                 9.246 ms   (8.792 ms .. 9.867 ms)
                     0.987 R²   (0.979 R² .. 0.997 R²)
mean                 10.05 ms   (9.832 ms .. 10.48 ms)
std dev              795.8 μs   (413.7 μs .. 1.474 ms)

...

benchmarking Int/Map/random
time                 17.38 ms   (17.23 ms .. 17.59 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 16.92 ms   (16.78 ms .. 17.05 ms)
std dev              336.7 μs   (285.7 μs .. 412.9 μs)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      1090 colls,     0 par    1.466s   1.466s     0.0013s    0.0061s
  Gen  1       192 colls,     0 par    0.117s   0.117s     0.0006s    0.0015s

First benchmark issue:

Looks like if the first benchmark is of "right" size, its execution time is underestimated as GC doesn't start to kick in yet:

EDIT: Doesn't seem to matter whether I do minor or major GC before the measurement.

With random first:

benchmarking Int/IntMap/random
time                 8.705 ms   (8.640 ms .. 8.771 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 8.895 ms   (8.807 ms .. 9.141 ms)
std dev              399.1 μs   (108.0 μs .. 855.1 μs)
variance introduced by outliers: 21% (moderately inflated)

benchmarking Int/IntMap/sorted
time                 2.395 ms   (2.363 ms .. 2.441 ms)
                     0.998 R²   (0.998 R² .. 0.999 R²)
mean                 2.527 ms   (2.503 ms .. 2.548 ms)
std dev              79.49 μs   (66.39 μs .. 97.10 μs)
variance introduced by outliers: 16% (moderately inflated)

Screenshot from 2022-01-23 14-58-33

(The shape of random time densitivies seems to be quite reproducible on my machine)

With random second:

benchmarking Int/IntMap/sorted
time                 2.424 ms   (2.403 ms .. 2.456 ms)
                     0.999 R²   (0.999 R² .. 0.999 R²)
mean                 2.520 ms   (2.500 ms .. 2.544 ms)
std dev              72.98 μs   (53.98 μs .. 114.8 μs)
variance introduced by outliers: 14% (moderately inflated)

benchmarking Int/IntMap/random
time                 9.031 ms   (8.990 ms .. 9.075 ms)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 9.107 ms   (9.069 ms .. 9.150 ms)
std dev              111.7 μs   (85.06 μs .. 145.9 μs)

Screenshot from 2022-01-23 15-00-16

patrickdoc commented 2 years ago

The major GC before each benchmark is still justified even before each sample only minor is run.

Yes, this definitely sounds correct. Although based on your samples it seems like even putting a major GC before every loop doesn't completely eliminate the issue.

I'm also confused why the reported times are different in #187 in before and after.

I would guess that at the time I interpreted < 1ns as within the "noise" range of measurement.

Using major GC makes criterion spend more time in GC, but if nursery is appropriately sized, the time increase is acceptable. OTOH, if you don't increase nursery size, benchmark results are significantly affected by GCs.

To some extent, allocation + GC time are a component of a given function's performance. The benchmark captures that by letting GC's naturally occur and then smoothing out the curve to include them.

The benchmark is looking to collect a certain threshold of acceptable measurements by progressively increasing the number of executions per measurement: https://github.com/haskell/criterion/blob/52ef4a7699237a8952475146c51fc5d8ebff5876/Criterion/Measurement.hs#L220

If your function takes 5ms to run, then the benchmark will only include measurements that include > 6 executions (>30ms total). It will continue increasing executions until the sum of all the valid measurements is > 300ms (10 * 30ms).

What this means is that anything above a certain ratio of allocations/time, you'll always trigger GCs. Increasing the nursery size increases the minimum ratio that would trigger GCs, but the "right" value depends on your function.

(Here we should consider using -T and use collections as an additional independent variable, as Jane street's blog post suggests).

Looks like I left a related comment when looking into the stats available from the GC result: https://github.com/haskell/criterion/pull/187#discussion_r170494245 . I think it is possible to extract mutator time + gc time to potentially eliminate GC from the measurement, but I'm not sure that gives the best picture of performance.

I believe we've traded precision (consistent, mutator-only measurements) for accuracy (slightly less precise, but adding in a measured impact from GC).

The original concern was that the lack of major GC's was skewing the numbers, but your recent results are showing that it might be an issue about the earlier benchmarks affecting later ones? Is that the current state of the investigation?

phadej commented 2 years ago

I disagree.

To some extent, allocation + GC time are a component of a given function's performance.

Allocation is indeed a component of function's performance, and I don't think it makes sense to try to isolate that. (Allocation rate and time are correlated). However, GC occuring or not is a global property of a program and RTS settings.

So I interpret this as criterion doesn't try to minimize the effect of GC, and only runs performMinorGC to update GC statistics. Then I'll make a PR to do GC only if GC stats are available (i.e. +RTS -T, using getRTSStatsEnabled

patrickdoc commented 2 years ago

I'm remembering now that the GC stats are presented as a diff since the last one. So the pre-loop GC acts as a baseline, and the post-loop GC is used to update GC stats.

What's the benefit of skipping those minor GCs? Just the benchmark running faster, or are you expecting the results to be affected?

phadej commented 2 years ago

I expect to result be different as GCs will become less deterministic, but that is good isn't it, You said

To some extent, allocation + GC time are a component of a given function's performance.

We shouldn't favor small inputs (which may run without needing to do a single minor GCs), and instead let them (non-deterministically) have some too.

patrickdoc commented 2 years ago

Yeah, I suppose that is consistent with the rest of the thinking.

haskell / criterion

use performMajorGC #257

Default nursery size:

With minor GC

With major GC:

-A64m

With minor GC

With major GC

First benchmark issue: