ordo-one / package-benchmark

Swift benchmark runner with many performance metrics and great CI support
Apache License 2.0
313 stars 25 forks source link

fix: threshold-tolerances sometimes are not respected #276

Open MahdiBM opened 2 weeks ago

MahdiBM commented 2 weeks ago

Description

Resolves #275 by loading benchmarks when they should be loaded.

How Has This Been Tested?

Click to show benchmark code ```swift let benchmarks = { Benchmark.defaultConfiguration = .init( metrics: [.cpuTotal], timeUnits: .nanoseconds, warmupIterations: 1, scalingFactor: .one, maxDuration: .seconds(1000), maxIterations: 1 ) func defaultCounter() -> Int { 10 } func dummyCounter(_ count: Int) { for index in 0 ..< count { blackHole(index) } } Benchmark( "CountNumbersBenchmark", configuration: .init( scalingFactor: .kilo, thresholds: [ .cpuTotal: .init( /// Tolerate up to 1% of difference compared to the threshold. relative: [.p90: 1], /// Tolerate up to 80ms of difference compared to the threshold. absolute: [.p90: 80_000_000] ) ] ) ) { _ in for _ in 0 ... 1_000_000 { dummyCounter(defaultCounter() * 1000) } } } ```

Right now, the following is the result of the benchmark, with fixes applied:

✅ Pull request has no significant performance differences ✅

Benchmark check running at 2024-09-18 02:48:03 UTC

Baseline mmbm-no-twice-benchmark-run is EQUAL to the defined absolute baseline thresholds. (--check-absolute)

Baseline mmbm-no-twice-benchmark-run

Host 0d12d771afd8 with 64 aarch64 processors with 125 GB memory, running:
#24~22.04.1-Ubuntu SMP Sat Jul 20 10:48:15 UTC 2024

MyBenchmarks

CountNumbersBenchmark

Metric p0 p25 p50 p75 p90 p99 p100 Samples
Time (total CPU) (ns) * 7710000 7710000 7710000 7710000 7710000 7710000 7710000 1

As you can see, the result shows 7710000 aka 7.71s of cpu time (now that I look, that ns label must be a bug, probably caused by the scaling factor usage. See benchmark code above). But the threshold file is:

{
  "cpuTotal" : 7700000000
}

Without this fix, this check would fail since the numbers are not equal. With this fix, the benchmark completes successfully since it does load the threshold tolerances as declared in the benchmark code:

            thresholds: [
                .cpuTotal: .init(
                    /// Tolerate up to 1% of difference compared to the threshold.
                    relative: [.p90: 1],
                    /// Tolerate up to 80ms of difference compared to the threshold.
                    absolute: [.p90: 80_000_000]
                )
            ]

Minimal checklist:

codecov[bot] commented 2 weeks ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 65.32%. Comparing base (75e8622) to head (a19235e).

Additional details and impacted files [![Impacted file tree graph](https://app.codecov.io/gh/ordo-one/package-benchmark/pull/276/graphs/tree.svg?width=650&height=150&src=pr&token=hXHmhEG1iF&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ordo-one)](https://app.codecov.io/gh/ordo-one/package-benchmark/pull/276?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ordo-one) ```diff @@ Coverage Diff @@ ## main #276 +/- ## ======================================= Coverage 65.32% 65.32% ======================================= Files 33 33 Lines 3610 3610 ======================================= Hits 2358 2358 Misses 1252 1252 ``` ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/ordo-one/package-benchmark/pull/276?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ordo-one). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ordo-one) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/ordo-one/package-benchmark/pull/276?dropdown=coverage&src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ordo-one). Last update [75e8622...a19235e](https://app.codecov.io/gh/ordo-one/package-benchmark/pull/276?dropdown=coverage&src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ordo-one). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=ordo-one).
hassila commented 2 weeks ago

Thanks for the PR, I will try to review soonish, I am out of office this week, so it will take some time.

MahdiBM commented 2 weeks ago

No worries, no rush, thanks 🙏.

MahdiBM commented 2 weeks ago

I wonder if this has now made the tool run the benchmarks again? I can't tell properly, I'll have to do some investigations.

MahdiBM commented 2 weeks ago

Ok apparently there are regressions in Swift 6 Linux toolchains, ranging from significant to absolutely-massive ... that might have been why it made me think it's rerunning the benchmarks. Still investigating.