Open hanhanW opened 10 months ago
Could we get the generated object files or profiles? This is one dispatch only, right?
I may be able to run a profiling / triage pass on this, but it won't be for a few days. If anyone else has cycles / wants to learn about profiling and optimizing the compiler, here's what I would do:
Reference https://iree.dev/developers/debugging/compile-time-regressions/. That is a playbook for working on compile time regressions.
First,
Identify the compile command that got slower.
Generating /work/build-tsan/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.vmfb from e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.mlir
and related test compilations as part of building iree-test-deps
iree-build/build.ninja
for that string yields the full compile command to repro: COMMAND = cd /usr/local/google/home/scotttodd/code/iree-build/tests/e2e/matmul && /usr/local/google/home/scotttodd/code/iree-build/tools/iree-compile --output-format=vm-bytecode --mlir-print-op-on-diagnostic=false --iree-hal-target-backends=llvm-cpu --iree-opt-data-tiling=false --iree-llvmcpu-enable-ukernels=none /usr/local/google/home/scotttodd/code/iree-build/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.mlir -o /usr/local/google/home/scotttodd/code/iree-build/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.vmfb --iree-hal-executable-object-search-path=\"/usr/local/google/home/scotttodd/code/iree-build\" --iree-llvmcpu-embedded-linker-path=\"/usr/local/google/home/scotttodd/code/iree-build/llvm-project/bin/lld\" --iree-llvmcpu-wasm-linker-path=\"/usr/local/google/home/scotttodd/code/iree-build/llvm-project/bin/lld\"
DESC = Generating /usr/local/google/home/scotttodd/code/iree-build/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.vmfb from e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.mlir
(formatted)
cd iree-build/tests/e2e/matmul
iree-build/tools/iree-compile \
--output-format=vm-bytecode \
--mlir-print-op-on-diagnostic=false \
--iree-hal-target-backends=llvm-cpu \
--iree-opt-data-tiling=false \
--iree-llvmcpu-enable-ukernels=none
iree-build/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.mlir \
-o iree-build/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.vmfb \
where that .mlir file is generated via
cd iree-build/tests/e2e/matmul
iree/.venv/bin/python3 iree/tests/e2e/matmul/generate_e2e_matmul_tests.py \
--output_code=iree-build/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.mlir \
--output_trace=iree-build/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.yaml \
--module_path=iree-build/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.vmfb \
--lhs_rhs_type=f16 \
--acc_type=f16 \
--shapes=large
(could upload that file here)
The repro command can be further narrowed down to specific slow executables with the --iree-hal-dump-executable-sources-to
, --compile-mode=hal-executable
,timeout
trick on https://iree.dev/developers/debugging/compile-time-regressions/#running-git-bisect.
With a (minimal) repro command and a culprit commit, you can run your choice of profiling - -mlir-timing
, perf, Tracy, etc.
Intuition / predictions: I expect that a trace will look like https://github.com/openxla/iree/issues/13189#issuecomment-1516682377, but it would be good to know what code patterns are causing the compiler to generate so much code.
I've started debugging this, following the steps listed above.
Generate the test .mlir using generate_e2e_matmul_tests.py
:
python D:/dev/projects/iree/tests/e2e/matmul/generate_e2e_matmul_tests.py --output_code=D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.mlir --output_trace=D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.yaml --module_path=D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.vmfb --lhs_rhs_type=f16 --acc_type=f16 --shapes=large
Compile to executable sources then compile each of those with a timeout:
TIMEOUT_SECONDS_FOR_COMPILING_EACH_SOURCE=10
SOURCES_DIR=D:/dev/projects/iree-build/tests/e2e/matmul/sources
D:/dev/projects/iree-build/tools/iree-compile \
--iree-hal-target-backends=llvm-cpu \
--iree-opt-data-tiling=false \
--iree-llvmcpu-enable-ukernels=none \
--iree-hal-dump-executable-sources-to=${SOURCES_DIR} \
--compile-to=executable-sources \
D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.mlir \
-o /dev/null
# Run the compiler again on each executable individually.
echo "** Running iree-compile for each executable source **"
SOURCES=($(ls -1 ${SOURCES_DIR}))
for SOURCE in "${SOURCES[@]}"; do
echo " * Compiling: ${SOURCE} *"
/usr/bin/timeout --verbose ${TIMEOUT_SECONDS_FOR_COMPILING_EACH_SOURCE} \
D:/dev/projects/iree-build/tools/iree-compile \
${SOURCES_DIR}/${SOURCE} \
--iree-hal-target-backends=llvm-cpu \
--iree-opt-data-tiling=false \
--iree-llvmcpu-enable-ukernels=none \
--compile-mode=hal-executable \
-o /dev/null
done
λ bash D:\dev\scratch\iree-tmp\compile_time_regression_issue_15800.sh
** Running iree-compile for each executable source **
* Compiling: module_matmul_1000x1000xf16_times_1000x1xf16_into_1000x1xf16_dispatch_0.mlir *
* Compiling: module_matmul_654x321xf16_times_321x234xf16_into_654x234xf16_dispatch_0.mlir *
* Compiling: module_matmul_accumulate_1000x1000xf16_times_1000x1xf16_into_1000x1xf16_dispatch_0.mlir *
* Compiling: module_matmul_accumulate_123x456xf16_times_456x789xf16_into_123x789xf16_dispatch_0.mlir *
/usr/bin/timeout: sending signal TERM to command ‘D:/dev/projects/iree-build/tools/iree-compile’
* Compiling: module_matmul_accumulate_1x1000xf16_times_1000x1000xf16_into_1x1000xf16_dispatch_0.mlir *
* Compiling: module_matmul_accumulate_DYNxDYNxf16_times_DYNxDYNxf16_into_DYNxDYNxf16_dispatch_0.mlir *
/usr/bin/timeout: sending signal TERM to command ‘D:/dev/projects/iree-build/tools/iree-compile’
* Compiling: module_matmul_DYNxDYNxf16_times_DYNxDYNxf16_into_DYNxDYNxf16_dispatch_0.mlir *
/usr/bin/timeout: sending signal TERM to command ‘D:/dev/projects/iree-build/tools/iree-compile’
This gist has the generated e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.mlir
as well as the sources that timed out:
https://gist.github.com/ScottTodd/e7286e020554caa5f57f444c7c0a34d4
Next steps
I'm running at https://github.com/openxla/iree/commit/15ca58e19ec76fab94c4aba8f75091c532282d51 (culprit code for the compile time regression is still included in my build) on Windows, RelWithDebInfo
, no sanitizers enabled.
Diffing IR with and without --iree-opt-data-tiling=false --iree-llvmcpu-enable-ukernels=none
, the slow compile path (left, turn off data tiling and ukernels using those flags) is one executable source containing a linalg.matmul
and the fast compile path (right, with data tiling and ukernels) is 5 executable sources - 3 pack, 1 mmt4d, 1 unpack:
The slow matmul codegen has this IR at --mlir-print-ir-before=iree-hal-serialize-target-executables
: https://gist.github.com/ScottTodd/76cd2d9f76cf5461c2900320b6f34269
Trace looks like this:
Sampling shows LLVM codegen doing lots of work:
Thanks a lot, Scott! Reading your report brought something to mind... Could you check if https://github.com/openxla/iree/pull/15819 significantly reduces the compile time for this case?
Here is some --mlir-print-ir-before=iree-hal-serialize-target-executables
IR from just before the regression: https://gist.github.com/ScottTodd/7e236e34dc5ad001cef51d1883be21b4 (compare with https://gist.github.com/ScottTodd/76cd2d9f76cf5461c2900320b6f34269, which is after the regression)
So ~600 MLIR ops in the executable that is passed to LLVM --> ~1400 ops. Trace time ~5 seconds --> ~14 seconds (approx linear with the number of ops?).
I'll run with that PR next, sure.
Hey @dcaballe - yes, for the specific case I've been profiling (module_matmul_accumulate_123x456xf16_times_456x789xf16_into_123x789xf16_dispatch_0.mlir
) that PR does help significantly.
IR with --mlir-print-ir-before=iree-hal-serialize-target-executables
(compare with above): https://gist.github.com/ScottTodd/9ca792a3047138826aa79c2a377a3846
~500 ops, ~2.2s in Tracy
Would anyone like any .mlir files, traces, etc. captured while I'm set up to collect them? Can I pass this issue back to @hanhanW / @dcaballe ?
Also still thinking about ways to send a warning/error signal when compilation (and possibly runtime as well) is going to be slow (https://github.com/openxla/iree/issues/13207)...
It would be good if you can attach module_matmul_accumulate_123x456xf16_times_456x789xf16_into_123x789xf16_dispatch_0.mlir
file to the issue. I can take a look at it when I have some cycles.
That's in one of the gists above, here it is inlined:
// module_matmul_accumulate_123x456xf16_times_456x789xf16_into_123x789xf16_dispatch_0.mlir
hal.executable public @matmul_accumulate_123x456xf16_times_456x789xf16_into_123x789xf16_dispatch_0 {
hal.executable.variant public @embedded_elf_x86_64 target(<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : index, target_triple = "x86_64-unknown-unknown-eabi-elf", ukernels = "none"}>) {
hal.executable.export public @matmul_accumulate_123x456xf16_times_456x789xf16_into_123x789xf16_dispatch_0_matmul_123x789x456_f16 ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>]>]>) {
^bb0(%arg0: !hal.device):
%x, %y, %z = flow.dispatch.workgroup_count_from_slice
hal.return %x, %y, %z : index, index, index
}
builtin.module {
func.func @matmul_accumulate_123x456xf16_times_456x789xf16_into_123x789xf16_dispatch_0_matmul_123x789x456_f16() {
%c0 = arith.constant 0 : index
%0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<123x456xf16>>
%1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<456x789xf16>>
%2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<readwrite:tensor<123x789xf16>>
%3 = flow.dispatch.tensor.load %0, offsets = [0, 0], sizes = [123, 456], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<123x456xf16>> -> tensor<123x456xf16>
%4 = flow.dispatch.tensor.load %1, offsets = [0, 0], sizes = [456, 789], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<456x789xf16>> -> tensor<456x789xf16>
%5 = flow.dispatch.tensor.load %2, offsets = [0, 0], sizes = [123, 789], strides = [1, 1] : !flow.dispatch.tensor<readwrite:tensor<123x789xf16>> -> tensor<123x789xf16>
%6 = linalg.matmul ins(%3, %4 : tensor<123x456xf16>, tensor<456x789xf16>) outs(%5 : tensor<123x789xf16>) -> tensor<123x789xf16>
flow.dispatch.tensor.store %6, %2, offsets = [0, 0], sizes = [123, 789], strides = [1, 1] : tensor<123x789xf16> -> !flow.dispatch.tensor<readwrite:tensor<123x789xf16>>
return
}
}
}
}
okay, thank you!
@ScottTodd and @antiagainst observed that the compilation time is regressed in below test suite after landing https://github.com/openxla/iree/commit/68d0df6d0e54e751c6223044320133bf72e465a6. TSAN job took ~12 minutes 4 weeks ago, now ~1 hour.
For more details, see https://discord.com/channels/689900678990135345/689957613152239638/1181732251659145296