ordo-one / package-benchmark

Swift benchmark runner with many performance metrics and great CI support
Apache License 2.0
326 stars 25 forks source link

Seemingly erroneous scaled, pretty-printed results. #209

Closed loonatick-src closed 11 months ago

loonatick-src commented 11 months ago

OS: macOS 14.1 23B2073 arm64 Package benchmark version: 1.16.0 Swift version: Apple Swift version 5.11-dev (LLVM e131e99f323910c, Swift 1c3bd3e722f8030) Target: arm64-apple-macosx14.0

I had been using Benchmark 1.4 in an internal project, but on upgrading to 1.16.0 the printed results of the benchmarks all became absurdly small even though the benchmarks were taking the same amount of time to execute. I have created a reproducer for this here. It has four simple benchmarks

If you clone the above repo and manually execute the benchmark target

$ swift run -c release DemoBenchmarks

I get e.g. the following output for one of the benchmarks

100% [------------------------------------------------------------] ETA: 00:00:00 | :sleep(1)
Debug results for sleep(1):

Time (wall clock):
       Value     Percentile TotalCount 1/(1-Percentile)

1002438655.000 0.000000000000          1           1.00    # 1 billion ns for sleep(1), makes sense
1002438655.000 0.100000000000          1           1.11
1005060095.000 0.200000000000         10           1.25
1005060095.000 1.000000000000         10
#[Mean    = 1004535808.000, StdDeviation   =   786432.000]
#[Max     = 1005060095.000, Total count    =           10]
#[Buckets =           20, SubBuckets     =         2048]

Time (total CPU):
       Value     Percentile TotalCount 1/(1-Percentile)

   10879.000 0.000000000000          1           1.00
   10879.000 0.100000000000          1           1.11
   11087.000 0.200000000000          2           1.25
   12215.000 0.300000000000          3           1.43
   12543.000 0.400000000000          4           1.67
   13167.000 0.500000000000          5           2.00
   13591.000 0.550000000000          6           2.22
   13591.000 0.600000000000          6           2.50
   15215.000 0.650000000000          7           2.86
   15215.000 0.700000000000          7           3.33
   17215.000 0.750000000000          8           4.00
   17215.000 0.775000000000          8           4.44
   17215.000 0.800000000000          8           5.00
   17679.000 0.825000000000          9           5.71
   17679.000 0.850000000000          9           6.67
   17679.000 0.875000000000          9           8.00
   17679.000 0.887500000000          9           8.89
   17679.000 0.900000000000          9          10.00
   20223.000 0.912500000000         10          11.43
   20223.000 1.000000000000         10
#[Mean    =    14377.200, StdDeviation   =     2948.207]
#[Max     =    20223.000, Total count    =           10]
#[Buckets =           20, SubBuckets     =         2048]

100% [------------------------------------------------------------] ETA: 00:00:00 | :Naive polynomial
Debug results for Naive polynomial:

Time (wall clock):
       Value     Percentile TotalCount 1/(1-Percentile)

159645695.000 0.000000000000          1           1.00
159776767.000 0.100000000000          8           1.11
159907839.000 0.200000000000         14           1.25
160169983.000 0.300000000000         21           1.43
160301055.000 0.400000000000         27           1.67
160432127.000 0.500000000000         33           2.00
160563199.000 0.550000000000         35           2.22
160694271.000 0.600000000000         39           2.50
160825343.000 0.650000000000         45           2.86
160825343.000 0.700000000000         45           3.33
160956415.000 0.750000000000         51           4.00
160956415.000 0.775000000000         51           4.44
160956415.000 0.800000000000         51           5.00
161087487.000 0.825000000000         54           5.71
161087487.000 0.850000000000         54           6.67
161218559.000 0.875000000000         57           8.00
161218559.000 0.887500000000         57           8.89
161218559.000 0.900000000000         57          10.00
161349631.000 0.912500000000         59          11.43
161349631.000 0.925000000000         59          13.33
161480703.000 0.937500000000         62          16.00
161480703.000 0.943750000000         62          17.78
161480703.000 0.950000000000         62          20.00
161480703.000 0.956250000000         62          22.86
161480703.000 0.962500000000         62          26.67
161480703.000 0.968750000000         62          32.00
161480703.000 0.971875000000         62          35.56
161480703.000 0.975000000000         62          40.00
161480703.000 0.978125000000         62          45.71
161480703.000 0.981250000000         62          53.33
163708927.000 0.984375000000         63          64.00
163708927.000 1.000000000000         63
#[Mean    = 160493502.984, StdDeviation   =   651800.259]
#[Max     = 163708927.000, Total count    =           63]
#[Buckets =           20, SubBuckets     =         2048]

Time (total CPU):
       Value     Percentile TotalCount 1/(1-Percentile)

159514623.000 0.000000000000          1           1.00    #### 160-ish million nanoseconds for naive polynomial
159776767.000 0.100000000000          8           1.11
159907839.000 0.200000000000         18           1.25
160038911.000 0.300000000000         23           1.43
160169983.000 0.400000000000         30           1.67
160301055.000 0.500000000000         38           2.00
160301055.000 0.550000000000         38           2.22
160301055.000 0.600000000000         38           2.50
160432127.000 0.650000000000         46           2.86
160432127.000 0.700000000000         46           3.33
160563199.000 0.750000000000         49           4.00
160563199.000 0.775000000000         49           4.44
160694271.000 0.800000000000         53           5.00
160694271.000 0.825000000000         53           5.71
160825343.000 0.850000000000         56           6.67
160825343.000 0.875000000000         56           8.00
160825343.000 0.887500000000         56           8.89
161087487.000 0.900000000000         59          10.00
161087487.000 0.912500000000         59          11.43
161087487.000 0.925000000000         59          13.33
161218559.000 0.937500000000         60          16.00
161218559.000 0.943750000000         60          17.78
161218559.000 0.950000000000         60          20.00
161349631.000 0.956250000000         62          22.86
161349631.000 0.962500000000         62          26.67
161349631.000 0.968750000000         62          32.00
161349631.000 0.971875000000         62          35.56
161349631.000 0.975000000000         62          40.00
161349631.000 0.978125000000         62          45.71
161349631.000 0.981250000000         62          53.33
161480703.000 0.984375000000         63          64.00
161480703.000 1.000000000000         63
#[Mean    = 160239681.016, StdDeviation   =   461178.738]
#[Max     = 161480703.000, Total count    =           63]
#[Buckets =           20, SubBuckets     =         2048]

In particular, one of the benchmarks is just sleep(1), whose wall clock time is being correctly reported as 1 second $10^9$ nanoseconds. The naive polyomial benchmark is taking some 160 million ($1.6 \cdot 10^8$) nanoseconds. Not shown above are numbers for noop/empty loop (42 ns) and blackHole (800,000 ns). But when running using the package extension $ swift package benchmark --target DemoBenchmarks

I get the following results

==============
DemoBenchmarks
==============

Naive polynomial
╒══════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Metric                   │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ Time (total CPU) (ns) *  │     160 │     160 │     160 │     161 │     162 │     162 │     162 │      63 │ 
├──────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Time (wall clock) (ns) * │     160 │     160 │     161 │     161 │     162 │     163 │     163 │      63 │
╘══════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

Noop
╒══════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Metric                   │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ Time (total CPU) (ns) *  │       0 │       0 │       0 │       0 │       0 │       0 │       0 │   10000 │
├──────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Time (wall clock) (ns) * │       0 │       0 │       0 │       0 │       0 │       0 │       0 │    5437 │
╘══════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

blackHole(x)
╒══════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Metric                   │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ Time (total CPU) (ns) *  │       1 │       1 │       1 │       1 │       1 │       1 │       1 │   10000 │
├──────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Time (wall clock) (ns) * │       1 │       1 │       1 │       1 │       1 │       1 │       1 │   10000 │
╘══════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

sleep(1)
╒══════════════════════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╤═════════╕
│ Metric                   │      p0 │     p25 │     p50 │     p75 │     p90 │     p99 │    p100 │ Samples │
╞══════════════════════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╪═════════╡
│ Time (total CPU) (μs) *  │      11 │      12 │      17 │      19 │      20 │      25 │      25 │      10 │
├──────────────────────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│ Time (wall clock) (ms) * │    1003 │    1005 │    1005 │    1005 │    1005 │    1005 │    1005 │      10 │
╘══════════════════════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╧═════════╛

I see that

I did my best to verify that things are not actually being optimized out by inspecting the disassembled binary as well.

loonatick-src commented 11 months ago

Copying the benchmarks here:

import Benchmark
import BMDemos
import Foundation

let benchmarks = {
    let coefficients = Array(stride(from: 0.1, through: 10.0, by: 0.1 * Double.pi))
    Benchmark.defaultConfiguration = .init(
        metrics: [.cpuTotal, .wallClock],
        timeUnits: .automatic,
        warmupIterations: 0,
        scalingFactor: .mega,
        maxDuration: .seconds(10)
    )

    Benchmark("Noop") { benchmark in
        for _ in benchmark.scaledIterations {
        }
    }

    Benchmark("blackHole(x)") { benchmark in
        var x = 1.13413241e42
        for _ in benchmark.scaledIterations {
            blackHole(x)
        }
    }

    Benchmark("sleep(1)",
              configuration: Benchmark.Configuration(scalingFactor: .one)
    ) { benchmark in
        sleep(1)
    }

    Benchmark("Naive polynomial") { benchmark in
        let x = 1.13241241e34
        for _ in benchmark.scaledIterations {
            blackHole(polynomialNaive(x, coefficients: coefficients))
        }
    }
}

And this is what the naive polynomial calculation looks like

public func polynomialNaive(_ x: Double, coefficients cs: [Double]) -> Double {
    var result = 0.0
    for (i,c) in cs.enumerated() {
        result += c * pow(x,Double(i))
    }
    return result
}
loonatick-src commented 11 months ago

Sorry never mind I blundered, the tabular output seems to be dividing by the scaling factor, and everything works out just fine in that case.

hassila commented 11 months ago

You are right, it is now scaled by the inner loop scaling amount by default as it was more intuitive - you can pass --scale if you don't want that scaling. (Except for throughput that is always scaled)

hassila commented 11 months ago

(just fyi, the * denotes that the metric is scaled by the scaling factor)