`benchmark baseline check` doesn't respect threshold-tolerances when checking baseline against static thresholds

MahdiBM commented 2 months ago

benchmark baseline check doesn't respect threshold-tolerances when checking already-acquired baseline which is on disk, against static thresholds .json files. Apparently it's because the benchmarks are not loaded so the tool falls back to the strict threshold-tolerances.

This makes our benchmarks fail even though we have less than 1% and 100ms of fluctuations.

MahdiBM commented 1 month ago

@hassila I'm writing some real benchmarks now, and the benchmarks target needs to depend on a big monolithic target. The problem is, when I do benchmark run, and then benchmark check, it builds that monolithic target twice, each time taking 11 minutes on an arm 64cpu 128ram bare-metal machine.

While I see this as mostly a Swift/SwiftPM problem that builds take soooo long, I also notice that the benchmark package doesn't need to build the benchmarking target the second time as we're checking static thresholds vs already-calculated baselines.

So hopefully there can be a command or something so "check" can skip building or running anything at all, and only perform checks with what is saved on disk.

hassila commented 1 month ago

Yes, it is a known problem unfortunately:

https://github.com/swiftlang/swift-package-manager/issues/7210

Would be great if you can ping in that issue and explain the issue above - maybe we can add a "--skip-build" to work around it (but it really should be done by SwiftPM, we just use the official API and it rebuilds always...).

MahdiBM commented 1 month ago

Sure, but I'm skeptical about when will SwiftPM maintainers get to any issues. Would be very very nice to have this feature while we're waiting a possible few years for SwiftPM.

As always I'm open to submitting a PR if needed.

hassila commented 1 month ago

It is a bit tricky, as we use the build to also get us the paths to the resulting built executables - which are needed.

I tried a quick and dirty --skip-builds flag, but it won't fly.

I am open to suggestions/PR:s (still worth pinging in the case though, I do think the more noise there is about something, the more likely it is to be fixed... The squeaky wheel gets the grease.)

hassila commented 1 month ago

This issue should be fixed by https://github.com/ordo-one/package-benchmark/pull/284 now, if you can please have a go testing against that (it will be merged fairly soon).

MahdiBM commented 1 month ago

@hassila

          swift package --disable-sandbox \
            benchmark thresholds check \
            "${{ github.head_ref || github.ref_name }}" \
            --check-absolute-path $PWD/Benchmarks/Thresholds/ \
            --no-progress \
            --format markdown

did still build the benchmarks, unlike benchmark read.

hassila commented 1 month ago

--check-absolute-path is not used for that, you want to use --path

It needs to build the benchmarks for the check operation to get the latest threshold tolerances from the code itself.

MahdiBM commented 1 month ago

For what it's worth this issue is resolved now.

Also, for some reason, using:

          swift package --disable-sandbox \
            benchmark baseline update \
            "${{ github.head_ref || github.ref_name }}"

          swift package --disable-sandbox \
            benchmark thresholds check \
            "${{ github.head_ref || github.ref_name }}" \
            --path $PWD/Benchmarks/Thresholds/ \
            --no-progress \
            --format markdown

No longer builds the whole benchmarks target twice. I notice that it does try to build the benchmark target even in the second command because 1- I've confirmed thresholds-tolerances do work 2- I can see some warnings about a dependency of the benchmark target (Swift 6 regression); but it pretty much just uses the previous build and does not waste time building the target again. No complaints as this solves my whole problem, but would be nice to know what's going on?!

hassila commented 1 month ago

Glad to hear it is working better now - the build is a bit of a mystery of swiftpm I think - it should only rebuild what's needed between runs. Are there any differences on os/runtime environments between your tests when it failed vs now?

Should try to nail down the related swiftpm bug probably - maybe this can help track down some more info.

MahdiBM commented 1 month ago

Not really no environment differences. Maybe the fact that I'm using --disable-sandbox helped? But I think I enabled using that sooner and it didn't work? Can try that later to see if it's helping anything 🤔.

hassila commented 1 month ago

Perhaps - I haven't nailed down exactly when it rebuilds, just that it's way too often. Happy to hear results if your try it.

hassila commented 1 month ago

But maybe close this issue then and we can discuss the builds separately?

MahdiBM commented 1 month ago

I mean we can even close this issue and keep talking 😛

MahdiBM commented 1 month ago

@hassila I couldn't find out why SwiftPM is deciding to not totally build the benchmark target from the ground up.

Right now, majority of the times it won't re-build the whole target. But sometimes it does. Even with no code changes (e.g. rerunning the same GitHub Action can result in a different behavior... perhaps the .build cache was updated and caused the different behavior).

I did try removing --disable-sandbox. No difference. Though I feel like I only started getting those "most of the times uses cache but sometimes rebuilds the whole target" after I removed --disable-sandbox?! not sure.

hassila commented 1 month ago

Yeah... thanks for trying - hope eventually figuring it out to try to fix root cause...

MahdiBM commented 1 month ago

@hassila This "caching sometimes works and sometimes not" situation is flaky enough that I decided to not rely on it and just only do 1 check instead of running comparison, then reading result, then doing the check. Just reporting back about what we've (so far) end up doing.

MahdiBM commented 1 month ago

This is also why I've filed this issue: https://github.com/ordo-one/package-benchmark/issues/288

MahdiBM commented 1 month ago

@hassila interesting observation:

It's been a few days I've added -c release to all swift package calls and there hasn't been any double-builds 🤔. I noticed the benchmarking package builds some stuff in release mode anyway to be able to reliably benchmark performance, so I thought maybe having both debug and release builds together goes too hard on SwiftPM. I'll monitor for the next few days as well and see if the double-build happens again or not.

Also for the record, using -c release did change the result of one of the two of our benchmarks from 310ms -> 290ms. Not sure why that should happen, might be a sign that we should use -c release.

hassila commented 1 month ago

Hmm, that is weird - if that is correct, it would imply that building in debug somehow invalidates the release build - we're just calling into the SwiftPM API and ask for a release build (and all normal caching should happen...).

MahdiBM commented 1 month ago

@hassila right ... this is definitely a bug somewhere ...

it would imply that building in debug somehow invalidates the release build

It would have been nice if it was that simple. The problem is it isn't consistent in invalidating the release build either. Or at least i haven't noticed the pattern.

Just to confirm, -c release is still working like I mentioned above, around 24h ago. I'll still monitor the situation.

ordo-one / package-benchmark

`benchmark baseline check` doesn't respect threshold-tolerances when checking baseline against static thresholds #275