Open ScottTodd opened 1 year ago
Here is a sample "big" MLIR file from Stable diffusion you can use.
Here is a sample command:
/home/anush/github/SHARK/shark.venv/lib/python3.10/site-packages/iree/compiler/tools/../_mlir_libs/iree-compile --iree-input-type=none --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=vulkan --iree-llvm-embedded-linker-path=/home/anush/github/SHARK/shark.venv/lib/python3.10/site-packages/iree/compiler/tools/../_mlir_libs/iree-lld --mlir-print-debuginfo --mlir-print-op-on-diagnostic=false --iree-llvm-target-cpu-features=host --iree-stream-resource-index-bits=64 --iree-vm-target-index-bits=64 --iree-util-zero-fill-elided-attrs -iree-vulkan-target-triple=rdna3-unknown-unknown --iree-flow-enable-conv-img2col-transform -o output.vmfb ./unet2base_8dec_fp16_torch.mlir
This compilation sometimes fails on Windows systems with 16GB RAM
Here is a perf report from the MLIR file:
6.01% iree-compile libIREECompiler.so.0 [.] mlir::Block::getParentOp
5.25% iree-compile libIREECompiler.so.0 [.] mlir::OpInterface<mlir::bufferization::BufferizableOpInterface, mlir:
4.54% iree-compile libIREECompiler.so.0 [.] llvm::APFloat::Storage::operator=
4.07% iree-compile libIREECompiler.so.0 [.] mlir::Operation::isProperAncestor
3.77% iree-compile libIREECompiler.so.0 [.] (anonymous namespace)::FoldConstantBase<(anonymous namespace)::FoldCo
2.62% iree-compile libIREECompiler.so.0 [.] llvm::detail::IEEEFloat::operator=
2.54% iree-compile libIREECompiler.so.0 [.] mlir::bufferization::detail::BufferizableOpInterfaceInterfaceTraits::
2.49% iree-compile libIREECompiler.so.0 [.] mlir::Lexer::lexString
2.30% iree-compile libIREECompiler.so.0 [.] mlir::bufferization::BufferizationOptions::isOpAllowed
1.80% iree-compile libIREECompiler.so.0 [.] llvm::detail::IEEEFloat::initFromHalfAPInt
1.72% iree-compile libIREECompiler.so.0 [.] mlir::Value::getDefiningOp
1.40% iree-compile libIREECompiler.so.0 [.] mlir::StringAttr::getValue
1.27% iree-compile libIREECompiler.so.0 [.] std::_Function_handler<bool (mlir::Operation*), mlir::bufferization::
1.22% iree-compile libIREECompiler.so.0 [.] mlir::DenseElementsAttr::IntElementIterator::operator*
1.22% iree-compile libIREECompiler.so.0 [.] llvm::detail::IEEEFloat::IEEEFloat
1.14% iree-compile libIREECompiler.so.0 [.] wouldCreateReadAfterWriteInterference
1.04% iree-compile libIREECompiler.so.0 [.] mlir::Token::getHexStringValue[abi:cxx11]
1.01% iree-compile libIREECompiler.so.0 [.] llvm::detail::IEEEFloat::convertHalfAPFloatToAPInt
0.95% iree-compile libIREECompiler.so.0 [.] mlir::DenseIntOrFPElementsAttr::getRaw
Tried to run with valgrind but ran into https://github.com/iree-org/iree/issues/11996
Here's a Tracy trace with just instrumentation for the MLIR file and sample command linked above (on my Windows machine, using code from https://github.com/iree-org/iree/commit/5f6f9892b5fe0c4b228e4e659e6205056d580c42): iree_compile_stable_diffusion_vulkan_2023_01_30.zip
We're discussing this here on Discord
Using a bytecode encoded format for UNET gets us:
8.11% iree-compile libIREECompiler.so.0 [.] llvm::APFloat::Storage::operator=
6.62% iree-compile libIREECompiler.so.0 [.] (anonymous namespace)::FoldConstantBase<(anonymous namespace)::FoldCo
4.96% iree-compile libIREECompiler.so.0 [.] llvm::detail::IEEEFloat::operator=
3.40% iree-compile libIREECompiler.so.0 [.] llvm::detail::IEEEFloat::initFromHalfAPInt
2.20% iree-compile libIREECompiler.so.0 [.] mlir::DenseElementsAttr::IntElementIterator::operator*
2.16% iree-compile libIREECompiler.so.0 [.] llvm::detail::IEEEFloat::IEEEFloat
1.75% iree-compile libIREECompiler.so.0 [.] mlir::DenseIntOrFPElementsAttr::getRaw
1.74% iree-compile libIREECompiler.so.0 [.] llvm::detail::IEEEFloat::convertHalfAPFloatToAPInt
1.42% iree-compile libIREECompiler.so.0 [.] (anonymous namespace)::FoldConstantBase<(anonymous namespace)::FoldCo
1.29% iree-compile [kernel.kallsyms] [k] clear_page_erms
1.16% iree-compile [kernel.kallsyms] [k] asm_exc_page_fault
1.14% iree-compile libIREECompiler.so.0 [.] mlir::detail::StorageUniquerImpl::getOrCreate
1.07% iree-compile libIREECompiler.so.0 [.] llvm::detail::IEEEFloat::copySignificand
0.99% iree-compile libIREECompiler.so.0 [.] mlir::NamedAttribute::getName
0.87% iree-compile libc.so.6 [.] 0x00000000001a0986
0.86% iree-compile libc.so.6 [.] ____wcstof_l_internal
0.85% iree-compile libIREECompiler.so.0 [.] std::_Function_handler<(anonymous namespace)::FoldConstantBase<(anony
0.78% iree-compile libIREECompiler.so.0 [.] mlir::detail::ElementsAttrRange<mlir::DenseElementsAttr::FloatElement
0.73% iree-compile libIREECompiler.so.0 [.] llvm::detail::IEEEFloat::~IEEEFloat
0.69% iree-compile libc.so.6 [.] ____wcstold_l_internal
0.64% iree-compile libIREECompiler.so.0 [.] llvm::detail::DenseSetImpl<(anonymous namespace)::ParametricStorageUn
0.61% iree-compile libIREECompiler.so.0 [.] mlir::detail::ShapedTypeInterfaceTraits::Model<mlir::RankedTensorType
0.58% iree-compile libIREECompiler.so.0 [.] propagateLiveness
0.58% iree-compile libIREECompiler.so.0 [.] mlir::applyPatternsAndFoldGreedily
0.56% iree-compile libIREECompiler.so.0 [.] mlir::Value::getDefiningOp
0.52% iree-compile libIREECompiler.so.0 [.] llvm::StringMapImpl::FindKey
0.51% iree-compile libIREECompiler.so.0 [.] mlir::DictionaryAttr::get
0.50% iree-compile libIREECompiler.so.0 [.] llvm::detail::IEEEFloat::bitcastToAPInt
here is a BLOOM model in bytecode format. It is one of 70 blocks that we have to compile together for the 176B model.
you can compile with
sudo perf record ./build/tools/iree-compile --iree-vm-bytecode-module-output-format=flatbuffer-binary --iree-hal-target-backends=cuda -iree-codegen-check-ir-before-llvm-conversion=false -o output.vmfb ../SHARK/bloom_block_0.mlir
top items are:
30.43% iree-compile libIREECompiler.so.0 [.] mlir::Lexer::lexString
12.72% iree-compile libIREECompiler.so.0 [.] mlir::Token::getHexStringValue[abi:cxx11]
10.43% iree-compile libIREECompiler.so.0 [.] GetOrCreateOffsetCache<unsigned long>
8.93% iree-compile [kernel.kallsyms] [k] asm_exc_page_fault
7.69% iree-compile [kernel.kallsyms] [k] clear_page_erms
4.48% iree-compile libIREECompiler.so.0 [.] llvm::detail::IEEEFloat::IEEEFloat
1.41% iree-compile libIREECompiler.so.0 [.] llvm::hashing::detail::hash_combine_range_impl<char const>
1.33% iree-compile [kernel.kallsyms] [k] __handle_mm_fault
1.10% iree-compile [kernel.kallsyms] [k] down_read_trylock
0.77% iree-compile [kernel.kallsyms] [k] up_read
0.60% iree-compile [kernel.kallsyms] [k] native_flush_tlb_one_user
0.60% iree-compile libc.so.6 [.] 0x00000000001a11ca
0.59% iree-compile [kernel.kallsyms] [k] count_shadow_nodes
0.56% iree-compile [kernel.kallsyms] [k] rmqueue_bulk
I don't see lexString or getHexStringValue in perf profile if I just use mlir-opt on this file, I'd expect if this happened during the loading the bytecode reading both sides would look the same.
I presented at the IREE community meeting on profiling the compiler using Tracy, slides here (sorry, no recording due to a technical issue >_>)
Here are some of the other issues I've reported / linked to from this issue:
We've also been talking about memory usage, and it's seeming like we'll get some benefits from general optimization (better deduplication, IR simplification), but some of the larger savings may have to come from upstream improvements like https://discourse.llvm.org/t/rfc-introducing-mlir-operation-properties/67846 . There are also some ideas floating around about setting the maximum number of threads or the maximum memory usage, then trimming cached data (like MLIR attributes) when those limits are approached.
@ScottTodd Bumping this one back up - looks like the attached issues are still open. Should we leave this as a P1 for continued discussion?
Yeah, I'd keep this open for continued discussion. There are other issues filed for specific bottlenecks and overall areas to focus on, while this issue is for discussing overall compiler profiling/optimization strategy, workloads of interest, and current status across all the sub-issues.
We've reached the point where the time it takes to run the IREE compiler and the memory that the compiler uses are both significant pain points for certain input programs and compilation modes. (Not to be confused with "the time it takes to build the IREE compiler" or "the time it takes to run compiled IREE programs" - isn't language fun?)
This issue summarizes some of the tools we have available for analyzing compiler performance.
Areas of focus
In no particular order, we've seen performance impacts in these areas:
#util.composite
and other ops to mitigate alloc explosions, tosa/linalg/etc. could do similar thingsProfiling methods
Profiling with Tracy
We regularly use the Tracy Profiler to analyze IREE runtime performance (code/function sequencing, function execution time, aggregate statistics, memory usage over time, comparisons across runs, etc.). We can similarly use Tracy to analyze IREE compiler performance.
Some earlier work added Tracy instrumentation to MLIR passes via the TracingUtils.h file, which uses MLIR's
PassInstrumentation
to begin and end trace zones before and after passes. This can be enabled with the CMake options-DIREE_ENABLE_RUNTIME_TRACING=ON -DIREE_ENABLE_COMPILER_TRACING=ON
.Tracy sampling (when running with elevated permissions and seeing individual functions/instructions, not just instrumented zones) used to work, but it looks like it has bitrotted.
iree-compile
finishes, before Tracy can pull across it's full backlog of datatracy/server/TracyWorker.cpp:6693: void tracy::Worker::ProcessContextSwitch(const tracy::QueueContextSwitch &): Assertion `data.empty() || (uint64_t)data.back().End() <= (uint64_t)time' failed.
Memory usage could be tracked using
TracyAlloc
/TracyFree
(IREE_TRACE_ALLOC
/IREE_TRACE_FREE
withIREE_TRACING_FEATURE_ALLOCATION_TRACKING
andIREE_TRACING_FEATURE_ALLOCATION_CALLSTACKS
). These would need to be added to some part of LLVM/MLIR, possibly via overloadingvoid* operator new(std::size_t count)
andvoid operator delete(void* ptr)
as recommended by Tracy's manual.Running with
--mlir-timing
MLIR has a general instrumentation mode using the
--mlir-timing
and--mlir-timing-display
flags (docs here). This shows total execution time for each MLIR pass in an easy to parse format, though it does not offer as much flexibility or resolution as Tracy profiling.Running with other tools
perf
: https://perf.wiki.kernel.org/index.php/Main_PageLongitudinal tracking
We track compile time along with compiled program runtime and other metrics (including compiled program binary size) on perf.iree.dev. These numbers have high variance but are good for longitudinal tracking and general trends across workloads. For example, we can see that int8 (quantized) generally takes much longer to compile for than fp32:
Other experiments / visualizations
I prototyped a tool for visualizing MLIR pipelines at https://scotttodd.github.io/iree-llvm-sandbox/web-tools/pipeline-visualizer/. That showed how many ops existed in each dialect over the course of compilation, which could be helpful to correlate with which passes or phases of compilation are contributing to performance. (That prototype has not been maintained)