Profile and optimize the IREE compiler

ScottTodd commented 1 year ago

We've reached the point where the time it takes to run the IREE compiler and the memory that the compiler uses are both significant pain points for certain input programs and compilation modes. (Not to be confused with "the time it takes to build the IREE compiler" or "the time it takes to run compiled IREE programs" - isn't language fun?)

This issue summarizes some of the tools we have available for analyzing compiler performance.

See also this discussion on Discord, starting with "We have SHARK users reporting compile fails with 16GB RAM systems"

Areas of focus

In no particular order, we've seen performance impacts in these areas:

Compilation is serialized where it could be made to run in parallel
Work is duplicated (similar/identical code is running through passes like codegen or canonicalization)
Death-by-a-thousand-papercuts in linalg/vector/etc (or our usage of such)
Pieces of MLIR infrastructure use inefficient algorithms (e.g. not caching common look-ups)
For memory usage, IREE uses #util.composite and other ops to mitigate alloc explosions, tosa/linalg/etc. could do similar things

Profiling methods

Profiling with Tracy

We regularly use the Tracy Profiler to analyze IREE runtime performance (code/function sequencing, function execution time, aggregate statistics, memory usage over time, comparisons across runs, etc.). We can similarly use Tracy to analyze IREE compiler performance.

Some earlier work added Tracy instrumentation to MLIR passes via the TracingUtils.h file, which uses MLIR's PassInstrumentation to begin and end trace zones before and after passes. This can be enabled with the CMake options -DIREE_ENABLE_RUNTIME_TRACING=ON -DIREE_ENABLE_COMPILER_TRACING=ON.

Tracy sampling (when running with elevated permissions and seeing individual functions/instructions, not just instrumented zones) used to work, but it looks like it has bitrotted.

On Windows, I see "Connection lost!" errors after iree-compile finishes, before Tracy can pull across it's full backlog of data
On Linux, I see a crash in the profiler UI: tracy/server/TracyWorker.cpp:6693: void tracy::Worker::ProcessContextSwitch(const tracy::QueueContextSwitch &): Assertion `data.empty() || (uint64_t)data.back().End() <= (uint64_t)time' failed.

Memory usage could be tracked using TracyAlloc/TracyFree (IREE_TRACE_ALLOC/IREE_TRACE_FREE with IREE_TRACING_FEATURE_ALLOCATION_TRACKING and IREE_TRACING_FEATURE_ALLOCATION_CALLSTACKS). These would need to be added to some part of LLVM/MLIR, possibly via overloading void* operator new(std::size_t count) and void operator delete(void* ptr) as recommended by Tracy's manual.

Running with `--mlir-timing`

MLIR has a general instrumentation mode using the --mlir-timing and --mlir-timing-display flags (docs here). This shows total execution time for each MLIR pass in an easy to parse format, though it does not offer as much flexibility or resolution as Tracy profiling.

$ mlir-opt foo.mlir -mlir-disable-threading -pass-pipeline='builtin.module(func.func(cse,canonicalize),convert-func-to-llvm)' -mlir-timing -mlir-timing-display=list

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0203 seconds

   ---Wall Time---  --- Name ---
   0.0047 ( 55.9%)  Canonicalizer
   0.0019 ( 22.2%)  VerifierPass
   0.0016 ( 18.5%)  LLVMLoweringPass
   0.0003 (  3.4%)  CSE
   0.0002 (  1.9%)  (A) DominanceInfo
   0.0084 (100.0%)  Total

Running with other tools

perf: https://perf.wiki.kernel.org/index.php/Main_Page

Longitudinal tracking

We track compile time along with compiled program runtime and other metrics (including compiled program binary size) on perf.iree.dev. These numbers have high variance but are good for longitudinal tracking and general trends across workloads. For example, we can see that int8 (quantized) generally takes much longer to compile for than fp32:

MobileBertSquad for CPU - fp32 (2023-01-30 average: 73 seconds)
MobileBertSquad for CPU - int8 (2023-01-30 average: 200 seconds)

Other experiments / visualizations

I prototyped a tool for visualizing MLIR pipelines at https://scotttodd.github.io/iree-llvm-sandbox/web-tools/pipeline-visualizer/. That showed how many ops existed in each dialect over the course of compilation, which could be helpful to correlate with which passes or phases of compilation are contributing to performance. (That prototype has not been maintained)