iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.57k stars 576 forks source link

Profile and optimize the IREE compiler #11994

Open ScottTodd opened 1 year ago

ScottTodd commented 1 year ago

We've reached the point where the time it takes to run the IREE compiler and the memory that the compiler uses are both significant pain points for certain input programs and compilation modes. (Not to be confused with "the time it takes to build the IREE compiler" or "the time it takes to run compiled IREE programs" - isn't language fun?)

This issue summarizes some of the tools we have available for analyzing compiler performance.

Areas of focus

In no particular order, we've seen performance impacts in these areas:

Profiling methods

Profiling with Tracy

We regularly use the Tracy Profiler to analyze IREE runtime performance (code/function sequencing, function execution time, aggregate statistics, memory usage over time, comparisons across runs, etc.). We can similarly use Tracy to analyze IREE compiler performance.

Some earlier work added Tracy instrumentation to MLIR passes via the TracingUtils.h file, which uses MLIR's PassInstrumentation to begin and end trace zones before and after passes. This can be enabled with the CMake options -DIREE_ENABLE_RUNTIME_TRACING=ON -DIREE_ENABLE_COMPILER_TRACING=ON.

image

Tracy sampling (when running with elevated permissions and seeing individual functions/instructions, not just instrumented zones) used to work, but it looks like it has bitrotted.

Memory usage could be tracked using TracyAlloc/TracyFree (IREE_TRACE_ALLOC/IREE_TRACE_FREE with IREE_TRACING_FEATURE_ALLOCATION_TRACKING and IREE_TRACING_FEATURE_ALLOCATION_CALLSTACKS). These would need to be added to some part of LLVM/MLIR, possibly via overloading void* operator new(std::size_t count) and void operator delete(void* ptr) as recommended by Tracy's manual.

Running with --mlir-timing

MLIR has a general instrumentation mode using the --mlir-timing and --mlir-timing-display flags (docs here). This shows total execution time for each MLIR pass in an easy to parse format, though it does not offer as much flexibility or resolution as Tracy profiling.

$ mlir-opt foo.mlir -mlir-disable-threading -pass-pipeline='builtin.module(func.func(cse,canonicalize),convert-func-to-llvm)' -mlir-timing -mlir-timing-display=list

===-------------------------------------------------------------------------===
                      ... Pass execution timing report ...
===-------------------------------------------------------------------------===
  Total Execution Time: 0.0203 seconds

   ---Wall Time---  --- Name ---
   0.0047 ( 55.9%)  Canonicalizer
   0.0019 ( 22.2%)  VerifierPass
   0.0016 ( 18.5%)  LLVMLoweringPass
   0.0003 (  3.4%)  CSE
   0.0002 (  1.9%)  (A) DominanceInfo
   0.0084 (100.0%)  Total

Running with other tools

Longitudinal tracking

We track compile time along with compiled program runtime and other metrics (including compiled program binary size) on perf.iree.dev. These numbers have high variance but are good for longitudinal tracking and general trends across workloads. For example, we can see that int8 (quantized) generally takes much longer to compile for than fp32:

image

Other experiments / visualizations

I prototyped a tool for visualizing MLIR pipelines at https://scotttodd.github.io/iree-llvm-sandbox/web-tools/pipeline-visualizer/. That showed how many ops existed in each dialect over the course of compilation, which could be helpful to correlate with which passes or phases of compilation are contributing to performance. (That prototype has not been maintained)

image

powderluv commented 1 year ago

Here is a sample "big" MLIR file from Stable diffusion you can use.

Here is a sample command:

/home/anush/github/SHARK/shark.venv/lib/python3.10/site-packages/iree/compiler/tools/../_mlir_libs/iree-compile --iree-input-type=none --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --iree-llvm-embedded-linker-path=/home/anush/github/SHARK/shark.venv/lib/python3.10/site-packages/iree/compiler/tools/../_mlir_libs/iree-lld --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-llvm-target-cpu-features=host --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs -iree-vulkan-target-triple=rdna3-unknown-unknown --iree-flow-enable-conv-img2col-transform -o output.vmfb ./unet2base_8dec_fp16_torch.mlir 

This compilation sometimes fails on Windows systems with 16GB RAM

powderluv commented 1 year ago

Here is a perf report from the MLIR file:

   6.01%  iree-compile  libIREECompiler.so.0  [.] mlir::Block::getParentOp                                             
   5.25%  iree-compile  libIREECompiler.so.0  [.] mlir::OpInterface<mlir::bufferization::BufferizableOpInterface, mlir:
   4.54%  iree-compile  libIREECompiler.so.0  [.] llvm::APFloat::Storage::operator=                                    
   4.07%  iree-compile  libIREECompiler.so.0  [.] mlir::Operation::isProperAncestor                                    
   3.77%  iree-compile  libIREECompiler.so.0  [.] (anonymous namespace)::FoldConstantBase<(anonymous namespace)::FoldCo
   2.62%  iree-compile  libIREECompiler.so.0  [.] llvm::detail::IEEEFloat::operator=                                   
   2.54%  iree-compile  libIREECompiler.so.0  [.] mlir::bufferization::detail::BufferizableOpInterfaceInterfaceTraits::
   2.49%  iree-compile  libIREECompiler.so.0  [.] mlir::Lexer::lexString                                               
   2.30%  iree-compile  libIREECompiler.so.0  [.] mlir::bufferization::BufferizationOptions::isOpAllowed               
   1.80%  iree-compile  libIREECompiler.so.0  [.] llvm::detail::IEEEFloat::initFromHalfAPInt                           
   1.72%  iree-compile  libIREECompiler.so.0  [.] mlir::Value::getDefiningOp                                           
   1.40%  iree-compile  libIREECompiler.so.0  [.] mlir::StringAttr::getValue                                           
   1.27%  iree-compile  libIREECompiler.so.0  [.] std::_Function_handler<bool (mlir::Operation*), mlir::bufferization::
   1.22%  iree-compile  libIREECompiler.so.0  [.] mlir::DenseElementsAttr::IntElementIterator::operator*               
   1.22%  iree-compile  libIREECompiler.so.0  [.] llvm::detail::IEEEFloat::IEEEFloat                                   
   1.14%  iree-compile  libIREECompiler.so.0  [.] wouldCreateReadAfterWriteInterference                                
   1.04%  iree-compile  libIREECompiler.so.0  [.] mlir::Token::getHexStringValue[abi:cxx11]                            
   1.01%  iree-compile  libIREECompiler.so.0  [.] llvm::detail::IEEEFloat::convertHalfAPFloatToAPInt                   
   0.95%  iree-compile  libIREECompiler.so.0  [.] mlir::DenseIntOrFPElementsAttr::getRaw    
powderluv commented 1 year ago

Tried to run with valgrind but ran into https://github.com/iree-org/iree/issues/11996

ScottTodd commented 1 year ago

Here's a Tracy trace with just instrumentation for the MLIR file and sample command linked above (on my Windows machine, using code from https://github.com/iree-org/iree/commit/5f6f9892b5fe0c4b228e4e659e6205056d580c42): iree_compile_stable_diffusion_vulkan_2023_01_30.zip

image

We're discussing this here on Discord

powderluv commented 1 year ago

Here are the two other models that are part of the Stable Diffusion pipeline VAE and clip

powderluv commented 1 year ago

Using a bytecode encoded format for UNET gets us:

   8.11%  iree-compile  libIREECompiler.so.0  [.] llvm::APFloat::Storage::operator=                                    
   6.62%  iree-compile  libIREECompiler.so.0  [.] (anonymous namespace)::FoldConstantBase<(anonymous namespace)::FoldCo
   4.96%  iree-compile  libIREECompiler.so.0  [.] llvm::detail::IEEEFloat::operator=                                   
   3.40%  iree-compile  libIREECompiler.so.0  [.] llvm::detail::IEEEFloat::initFromHalfAPInt                           
   2.20%  iree-compile  libIREECompiler.so.0  [.] mlir::DenseElementsAttr::IntElementIterator::operator*               
   2.16%  iree-compile  libIREECompiler.so.0  [.] llvm::detail::IEEEFloat::IEEEFloat                                   
   1.75%  iree-compile  libIREECompiler.so.0  [.] mlir::DenseIntOrFPElementsAttr::getRaw                               
   1.74%  iree-compile  libIREECompiler.so.0  [.] llvm::detail::IEEEFloat::convertHalfAPFloatToAPInt                   
   1.42%  iree-compile  libIREECompiler.so.0  [.] (anonymous namespace)::FoldConstantBase<(anonymous namespace)::FoldCo
   1.29%  iree-compile  [kernel.kallsyms]     [k] clear_page_erms                                                      
   1.16%  iree-compile  [kernel.kallsyms]     [k] asm_exc_page_fault                                                   
   1.14%  iree-compile  libIREECompiler.so.0  [.] mlir::detail::StorageUniquerImpl::getOrCreate                        
   1.07%  iree-compile  libIREECompiler.so.0  [.] llvm::detail::IEEEFloat::copySignificand                             
   0.99%  iree-compile  libIREECompiler.so.0  [.] mlir::NamedAttribute::getName                                        
   0.87%  iree-compile  libc.so.6             [.] 0x00000000001a0986                                                   
   0.86%  iree-compile  libc.so.6             [.] ____wcstof_l_internal                                                
   0.85%  iree-compile  libIREECompiler.so.0  [.] std::_Function_handler<(anonymous namespace)::FoldConstantBase<(anony
   0.78%  iree-compile  libIREECompiler.so.0  [.] mlir::detail::ElementsAttrRange<mlir::DenseElementsAttr::FloatElement
   0.73%  iree-compile  libIREECompiler.so.0  [.] llvm::detail::IEEEFloat::~IEEEFloat                                  
   0.69%  iree-compile  libc.so.6             [.] ____wcstold_l_internal                                               
   0.64%  iree-compile  libIREECompiler.so.0  [.] llvm::detail::DenseSetImpl<(anonymous namespace)::ParametricStorageUn
   0.61%  iree-compile  libIREECompiler.so.0  [.] mlir::detail::ShapedTypeInterfaceTraits::Model<mlir::RankedTensorType
   0.58%  iree-compile  libIREECompiler.so.0  [.] propagateLiveness                                                    
   0.58%  iree-compile  libIREECompiler.so.0  [.] mlir::applyPatternsAndFoldGreedily                                   
   0.56%  iree-compile  libIREECompiler.so.0  [.] mlir::Value::getDefiningOp                                           
   0.52%  iree-compile  libIREECompiler.so.0  [.] llvm::StringMapImpl::FindKey                                         
   0.51%  iree-compile  libIREECompiler.so.0  [.] mlir::DictionaryAttr::get                                            
   0.50%  iree-compile  libIREECompiler.so.0  [.] llvm::detail::IEEEFloat::bitcastToAPInt  
powderluv commented 1 year ago

here is a BLOOM model in bytecode format. It is one of 70 blocks that we have to compile together for the 176B model.

you can compile with

sudo perf record ./build/tools/iree-compile --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=cuda -iree-codegen-check-ir-before-llvm-conversion=false -o output.vmfb ../SHARK/bloom_block_0.mlir

top items are:

  30.43%  iree-compile  libIREECompiler.so.0  [.] mlir::Lexer::lexString                                               
  12.72%  iree-compile  libIREECompiler.so.0  [.] mlir::Token::getHexStringValue[abi:cxx11]                            
  10.43%  iree-compile  libIREECompiler.so.0  [.] GetOrCreateOffsetCache<unsigned long>                                
   8.93%  iree-compile  [kernel.kallsyms]     [k] asm_exc_page_fault                                                   
   7.69%  iree-compile  [kernel.kallsyms]     [k] clear_page_erms                                                      
   4.48%  iree-compile  libIREECompiler.so.0  [.] llvm::detail::IEEEFloat::IEEEFloat                                   
   1.41%  iree-compile  libIREECompiler.so.0  [.] llvm::hashing::detail::hash_combine_range_impl<char const>           
   1.33%  iree-compile  [kernel.kallsyms]     [k] __handle_mm_fault                                                    
   1.10%  iree-compile  [kernel.kallsyms]     [k] down_read_trylock                                                    
   0.77%  iree-compile  [kernel.kallsyms]     [k] up_read                                                              
   0.60%  iree-compile  [kernel.kallsyms]     [k] native_flush_tlb_one_user                                            
   0.60%  iree-compile  libc.so.6             [.] 0x00000000001a11ca                                                   
   0.59%  iree-compile  [kernel.kallsyms]     [k] count_shadow_nodes                                                   
   0.56%  iree-compile  [kernel.kallsyms]     [k] rmqueue_bulk                     
jpienaar commented 1 year ago

I don't see lexString or getHexStringValue in perf profile if I just use mlir-opt on this file, I'd expect if this happened during the loading the bytecode reading both sides would look the same.

ScottTodd commented 1 year ago

I presented at the IREE community meeting on profiling the compiler using Tracy, slides here (sorry, no recording due to a technical issue >_>)


Here are some of the other issues I've reported / linked to from this issue:

We've also been talking about memory usage, and it's seeming like we'll get some benefits from general optimization (better deduplication, IR simplification), but some of the larger savings may have to come from upstream improvements like https://discourse.llvm.org/t/rfc-introducing-mlir-operation-properties/67846 . There are also some ideas floating around about setting the maximum number of threads or the maximum memory usage, then trimming cached data (like MLIR attributes) when those limits are approached.

allieculp commented 1 year ago

@ScottTodd Bumping this one back up - looks like the attached issues are still open. Should we leave this as a P1 for continued discussion?

ScottTodd commented 1 year ago

Yeah, I'd keep this open for continued discussion. There are other issues filed for specific bottlenecks and overall areas to focus on, while this issue is for discussing overall compiler profiling/optimization strategy, workloads of interest, and current status across all the sub-issues.