iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.6k stars 584 forks source link

tracy-capture takes ~30 minutes on Buildkite benchmark dylib-sync / MobileBertSquad-int8 / little core / pixel4 #8816

Closed bjacob closed 2 years ago

bjacob commented 2 years ago

I'm getting timeouts preventing me from submiting #8735.

The particular benchmark timing out is just a victim: it's not long by itself, it just happens to start just before the 60 minutes timeout. Something else is taking nearly 30 minutes just before.

See log.

Relevant part:

2022-04-07T21:47:47Z] cmd: /var/lib/buildkite-agent/builds/IREE-LAB-RPI4-Pixel4-2-ubuntu-1008524139-1/iree/iree-benchmark/tracy-capture -f -o /tmp/iree-benchmarks/5f42853da6e2190c7a9201b21cf637775003f6d7/captures/MobileBertSquad [int8] (TFLite) little-core,full-inference,default-flags with IREE-Dylib-Sync @ Pixel-4 (CPU-ARMv8.2-A).tracy
[2022-04-07T22:14:08Z] Connecting to 127.0.0.1:8086...

27 minutes elapse between the first line (issuing the command) and the 2nd line (printed in the main() function near the beginning)

I prepared a diff to add some logging to understand exactly what is taking so long.

But it's not trivial to use it, because this is in Tracy not IREE per se, so one would have to update this to point to a build of tracy-capture with the above patch.

@antiagainst

bjacob commented 2 years ago

AAh i know! The timestamps are misleading: they are not from the process being executed but from the python subprocess framework. This tracy-capture process spews a massive amount of logging due to the progress info, similar to what #8806 fixes for adb push --- I'm going to apply a similar fix. If that doesn't fix the problem, it could be that the issue is inherent to this tracy trace being so large (25 M) unlike previous ones, so the fix might be to selectively opt out of tracy captures for such benchmarks.

antiagainst commented 2 years ago

Is this specific to the CI? Did you see the same issue locally? Yes the trace size does matter. Like, we used to have an issue where there were a 10x perf regression, which causes captures to contain information for 10x time. It just freezes at the end because all the captures (several GBs) are zipped and uploaded and eventually causing timeout. Though 25MB should be fine.

bjacob commented 2 years ago

Sorry, I figured it out, that was my own fault. This was with my own #8735 applied, which I realize now, was a bad idea. What it did was mass the same benchmark_repetition value to capture as we do for normal benchmarking. That caused the capture run to increase from 1 to 10 repetitions, causing a corresponding 10x increase in the capture size. The rest was the confusing timestamps from python subprocess redirect.

antiagainst commented 2 years ago

Ah, I see. Yeah we don't do repetitions during capture. Sorry I should have caught that issue in code reviews..