iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.58k stars 577 forks source link

Drop tracy from CI benchmarks? #16856

Closed bjacob closed 4 months ago

bjacob commented 6 months ago

Our benchmarks on CI run twice, without and with Tracy. This is a substantial cost and latency hit (not quite 2x as I can see that tracy is run with 4 repetitions where non-tracy does 10). The motivation was that we could download and inspect traces from regressions caught on CI, but does anyone do that in practice? We don't inspect CI benchmark regressions every day, and when we do, usually we just reproduce the regression locally, which is often necessary anyway to resolve it.

@pzread @hanhanW @benvanik @ScottTodd

ScottTodd commented 6 months ago

Could run an experiment with and without tracing to quantify the cost. For Linux the code to change appears to be https://github.com/openxla/iree/blob/767a6112abffebd896ceb2f0107c0d603c2a338a/build_tools/benchmarks/run_benchmarks_on_linux.py#L87-L110

ScottTodd commented 6 months ago

Another thing we could do would be to set --iree-hal-executable-debug-level=3 to embed source files, since that works now (thanks to https://github.com/openxla/iree/pull/16757). Traces would be more useful with that.

But yeah - not sure if enough people are monitoring the current CI benchmark infra to justify the ongoing costs, and dropping trace collection would likely save quite a bit of time.

pzread commented 6 months ago

I'm okay to drop it. But IMO it is useful in the cases you don't have the access to the devices (e.g. Pixel phones, Nvidia cards).

Recently I was debugging the regressions in https://github.com/openxla/iree/pull/16731 and I was able to download the baseline and PR tracy files to quickly spot the slowest dispatches and find the problems (in this case the regressions are large so it is easy to spot)

benvanik commented 6 months ago

the cost should only be in execution time as the same vmfbs are used - I thought we weren't bounded by execution time?

having them nightly/on releases would still be useful, if we have a nightly runner - when you need them you need them, and finding out you need to compare with something older and need to generate them can waste a day

benvanik commented 6 months ago

and yeah, we should probably always have debug info on the bots - it has zero runtime overhead but does introduce a text printing step to compilation - probably worth timing.

bjacob commented 6 months ago

Watch the logs while the benchmarks run and you'll see we are definitely waiting on execution :-) While the vmfbs are shared, there is also the overhead of transferring and storing the traces. Great data point @pzread , to hear you actually had a recent use case.

benvanik commented 6 months ago

The other thing we can do is make it opt-in for particular benchmarks. The major benefit to me is looking for improvement/regressions over time that show up in end-to-end performance, and I don't care about having it on every device or every configuration. Not having to check out an ancient branch, get things going, get it on the hardware we measure on, and perfectly replicated is important. But one run of each model architecture on each machine is worth it. We've got bigger fish to fry on our process than removing some of the tiny amount of information we actually have.

bjacob commented 6 months ago

Ran the experiment at https://github.com/openxla/iree/pull/16857... found that wholesale dropping all the related code would be a -345 lines of code shrink, and found the following timings from comparing that PR with another PR I ran today against the same main commit (https://github.com/openxla/iree/pull/16855):

CI job With tracy capture (#16855) Without tracy capture (#16857) Speedup
run_benchmarks (c2-standard-60, linux, x86_64, 0, 2) 22 min 2 s 11 min 46 s 1.87x
run_benchmarks (c2-standard-60, linux, x86_64, 1, 2) 21 min 1 s 13 min 36 s 1.54x
run_benchmarks (pixel-6-pro, android, armv8.2-a, 0, 1) 21 min 12 s 11 min 5 s 1.92x
hanhanW commented 6 months ago

What @pzread said, it is useful when you don't have the access to the devices. I'd use it to debug pixel regressions when there are needs. It mostly happens when I'm on LLVM integrate rotation. I think it is okay to drop because we can reassign those regressions to ARM and Google folks.

benvanik commented 6 months ago

yeah I don't think we care about the pixel phones anymore - I am mostly concerned with losing the ability to look at what fusion decisions/whole architecture decisions we are changing over time. If we can do that on one bot for one configuration (with the option to always add more) and only for the major architectures that'd be enough for me. I don't like the idea of losing coverage entirely, though. We have a really bad story around visibility into memory consumption, latency, and scheduling and the traces are literally all we have.

bjacob commented 6 months ago

What if we only disabled Tracy captures on PR runs?

ScottTodd commented 6 months ago

What if we only disabled Tracy captures on PR runs?

That seems reasonable to me. If someone wants to see them on presubmit they could even just edit the code that makes them postsubmit only (if it's a switch localized to 1-2 files).

ScottTodd commented 6 months ago

@pzread how do you feel about disabling Tracy on PR runs, or making it an extra opt-in?

benvanik commented 6 months ago

opt-in on PR via a label and then continuously run on main SGTM

pzread commented 6 months ago

Make it optional and off by default on PR runs SGTM

ScottTodd commented 4 months ago

It's been a few months with no action taken here. Is anyone using the traces from the benchmark CI? At this point I'd vote to remove Tracy from that path entirely, both to save on infrastructure costs and to make it easier to update the version of Tracy used in the project (we currently need to upload some builds of Tracy to GCS as in https://github.com/iree-org/iree/pull/15807)

benvanik commented 4 months ago

I do occasionally, but am fine with telling people that if they want me to investigate a performance issue they need to get me a trace so 🤷

benvanik commented 4 months ago

(all for having easier tracy bumps!)

bjacob commented 4 months ago

Maybe as a first step disable on PR runs as discussed above https://github.com/iree-org/iree/issues/16856#issuecomment-2012396348. Then after a month of that, ask whether to drop altogether.

ScottTodd commented 4 months ago

Maybe as a first step disable on PR runs as discussed above #16856 (comment). Then after a month of that, ask whether to drop altogether.

Unless that benchmarking has a dedicated owner committed to maintaining it, I'd rather cut down the costs ASAP (both maintenance complexity and $$$).

ScottTodd commented 4 months ago

Ran the experiment at #16857... found that wholesale dropping all the related code would be a -345 lines of code shrink, and found the following timings from comparing that PR with another PR I ran today against the same main commit (#16855):

CI job With tracy capture (#16855) Without tracy capture (#16857) Speedup
run_benchmarks (c2-standard-60, linux, x86_64, 0, 2) 22 min 2 s 11 min 46 s 1.87x
run_benchmarks (c2-standard-60, linux, x86_64, 1, 2) 21 min 1 s 13 min 36 s 1.54x
run_benchmarks (pixel-6-pro, android, armv8.2-a, 0, 1) 21 min 12 s 11 min 5 s 1.92x

I'm also noticing that the Android times doubled at some point. pixel-6-pro benchmarks now take around 40 minutes:

ScottTodd commented 4 months ago

Oh, I was looking at android-cpu + android-gpu (the default config for postsubmit), while that table was generated on a pull request with just android-cpu only. The timings make sense then, no time doubling. Phew.