iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.76k stars 600 forks source link

The compilation time is hugely regressed after #15752 #15800

Open hanhanW opened 10 months ago

hanhanW commented 10 months ago

@ScottTodd and @antiagainst observed that the compilation time is regressed in below test suite after landing https://github.com/openxla/iree/commit/68d0df6d0e54e751c6223044320133bf72e465a6. TSAN job took ~12 minutes 4 weeks ago, now ~1 hour.

For more details, see https://discord.com/channels/689900678990135345/689957613152239638/1181732251659145296

dcaballe commented 10 months ago

Could we get the generated object files or profiles? This is one dispatch only, right?

ScottTodd commented 10 months ago

I may be able to run a profiling / triage pass on this, but it won't be for a few days. If anyone else has cycles / wants to learn about profiling and optimizing the compiler, here's what I would do:


Reference https://iree.dev/developers/debugging/compile-time-regressions/. That is a playbook for working on compile time regressions.

First,


The repro command can be further narrowed down to specific slow executables with the --iree-hal-dump-executable-sources-to, --compile-mode=hal-executable,timeout trick on https://iree.dev/developers/debugging/compile-time-regressions/#running-git-bisect.

With a (minimal) repro command and a culprit commit, you can run your choice of profiling - -mlir-timing, perf, Tracy, etc.


Intuition / predictions: I expect that a trace will look like https://github.com/openxla/iree/issues/13189#issuecomment-1516682377, but it would be good to know what code patterns are causing the compiler to generate so much code.

ScottTodd commented 10 months ago

I've started debugging this, following the steps listed above.

Generate the test .mlir using generate_e2e_matmul_tests.py: python D:/dev/projects/iree/tests/e2e/matmul/generate_e2e_matmul_tests.py --output_code=D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.mlir --output_trace=D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.yaml --module_path=D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.vmfb --lhs_rhs_type=f16 --acc_type=f16 --shapes=large

Compile to executable sources then compile each of those with a timeout:

TIMEOUT_SECONDS_FOR_COMPILING_EACH_SOURCE=10
SOURCES_DIR=D:/dev/projects/iree-build/tests/e2e/matmul/sources

D:/dev/projects/iree-build/tools/iree-compile \
  --iree-hal-target-backends=llvm-cpu \
  --iree-opt-data-tiling=false \
  --iree-llvmcpu-enable-ukernels=none \
  --iree-hal-dump-executable-sources-to=${SOURCES_DIR} \
  --compile-to=executable-sources \
  D:/dev/projects/iree-build/tests/e2e/matmul/e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.mlir \
  -o /dev/null

# Run the compiler again on each executable individually.
echo "** Running iree-compile for each executable source **"
SOURCES=($(ls -1 ${SOURCES_DIR}))
for SOURCE in "${SOURCES[@]}"; do
  echo "  * Compiling: ${SOURCE} *"
  /usr/bin/timeout --verbose ${TIMEOUT_SECONDS_FOR_COMPILING_EACH_SOURCE} \
  D:/dev/projects/iree-build/tools/iree-compile \
    ${SOURCES_DIR}/${SOURCE} \
    --iree-hal-target-backends=llvm-cpu \
    --iree-opt-data-tiling=false \
    --iree-llvmcpu-enable-ukernels=none \
    --compile-mode=hal-executable \
    -o /dev/null
done
λ bash D:\dev\scratch\iree-tmp\compile_time_regression_issue_15800.sh
** Running iree-compile for each executable source **
  * Compiling: module_matmul_1000x1000xf16_times_1000x1xf16_into_1000x1xf16_dispatch_0.mlir *
  * Compiling: module_matmul_654x321xf16_times_321x234xf16_into_654x234xf16_dispatch_0.mlir *
  * Compiling: module_matmul_accumulate_1000x1000xf16_times_1000x1xf16_into_1000x1xf16_dispatch_0.mlir *
  * Compiling: module_matmul_accumulate_123x456xf16_times_456x789xf16_into_123x789xf16_dispatch_0.mlir *
/usr/bin/timeout: sending signal TERM to command ‘D:/dev/projects/iree-build/tools/iree-compile’
  * Compiling: module_matmul_accumulate_1x1000xf16_times_1000x1000xf16_into_1x1000xf16_dispatch_0.mlir *
  * Compiling: module_matmul_accumulate_DYNxDYNxf16_times_DYNxDYNxf16_into_DYNxDYNxf16_dispatch_0.mlir *
/usr/bin/timeout: sending signal TERM to command ‘D:/dev/projects/iree-build/tools/iree-compile’
  * Compiling: module_matmul_DYNxDYNxf16_times_DYNxDYNxf16_into_DYNxDYNxf16_dispatch_0.mlir *
/usr/bin/timeout: sending signal TERM to command ‘D:/dev/projects/iree-build/tools/iree-compile’

This gist has the generated e2e_matmul_nondt_f16_f16_large_llvm-cpu_local-task.mlir as well as the sources that timed out: https://gist.github.com/ScottTodd/e7286e020554caa5f57f444c7c0a34d4

Next steps

ScottTodd commented 10 months ago

I'm running at https://github.com/openxla/iree/commit/15ca58e19ec76fab94c4aba8f75091c532282d51 (culprit code for the compile time regression is still included in my build) on Windows, RelWithDebInfo, no sanitizers enabled.

Diffing IR with and without --iree-opt-data-tiling=false --iree-llvmcpu-enable-ukernels=none, the slow compile path (left, turn off data tiling and ukernels using those flags) is one executable source containing a linalg.matmul and the fast compile path (right, with data tiling and ukernels) is 5 executable sources - 3 pack, 1 mmt4d, 1 unpack:

image

The slow matmul codegen has this IR at --mlir-print-ir-before=iree-hal-serialize-target-executables: https://gist.github.com/ScottTodd/76cd2d9f76cf5461c2900320b6f34269

Trace looks like this: image

Sampling shows LLVM codegen doing lots of work: image

dcaballe commented 10 months ago

Thanks a lot, Scott! Reading your report brought something to mind... Could you check if https://github.com/openxla/iree/pull/15819 significantly reduces the compile time for this case?

ScottTodd commented 10 months ago

Here is some --mlir-print-ir-before=iree-hal-serialize-target-executables IR from just before the regression: https://gist.github.com/ScottTodd/7e236e34dc5ad001cef51d1883be21b4 (compare with https://gist.github.com/ScottTodd/76cd2d9f76cf5461c2900320b6f34269, which is after the regression)

ScottTodd commented 10 months ago

So ~600 MLIR ops in the executable that is passed to LLVM --> ~1400 ops. Trace time ~5 seconds --> ~14 seconds (approx linear with the number of ops?).

I'll run with that PR next, sure.

ScottTodd commented 10 months ago

Hey @dcaballe - yes, for the specific case I've been profiling (module_matmul_accumulate_123x456xf16_times_456x789xf16_into_123x789xf16_dispatch_0.mlir) that PR does help significantly.

IR with --mlir-print-ir-before=iree-hal-serialize-target-executables (compare with above): https://gist.github.com/ScottTodd/9ca792a3047138826aa79c2a377a3846

~500 ops, ~2.2s in Tracy image

ScottTodd commented 10 months ago

Would anyone like any .mlir files, traces, etc. captured while I'm set up to collect them? Can I pass this issue back to @hanhanW / @dcaballe ?

Also still thinking about ways to send a warning/error signal when compilation (and possibly runtime as well) is going to be slow (https://github.com/openxla/iree/issues/13207)...

hanhanW commented 10 months ago

It would be good if you can attach module_matmul_accumulate_123x456xf16_times_456x789xf16_into_123x789xf16_dispatch_0.mlir file to the issue. I can take a look at it when I have some cycles.

ScottTodd commented 10 months ago

That's in one of the gists above, here it is inlined:

// module_matmul_accumulate_123x456xf16_times_456x789xf16_into_123x789xf16_dispatch_0.mlir
hal.executable public @matmul_accumulate_123x456xf16_times_456x789xf16_into_123x789xf16_dispatch_0 {
  hal.executable.variant public @embedded_elf_x86_64 target(<"llvm-cpu", "embedded-elf-x86_64", {cpu = "generic", cpu_features = "", data_layout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-i128:128-f80:128-n8:16:32:64-S128", native_vector_size = 16 : index, target_triple = "x86_64-unknown-unknown-eabi-elf", ukernels = "none"}>) {
    hal.executable.export public @matmul_accumulate_123x456xf16_times_456x789xf16_into_123x789xf16_dispatch_0_matmul_123x789x456_f16 ordinal(0) layout(#hal.pipeline.layout<push_constants = 0, sets = [<0, bindings = [<0, storage_buffer, ReadOnly>, <1, storage_buffer, ReadOnly>, <2, storage_buffer>]>]>) {
    ^bb0(%arg0: !hal.device):
      %x, %y, %z = flow.dispatch.workgroup_count_from_slice 
      hal.return %x, %y, %z : index, index, index
    }
    builtin.module {
      func.func @matmul_accumulate_123x456xf16_times_456x789xf16_into_123x789xf16_dispatch_0_matmul_123x789x456_f16() {
        %c0 = arith.constant 0 : index
        %0 = hal.interface.binding.subspan set(0) binding(0) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<123x456xf16>>
        %1 = hal.interface.binding.subspan set(0) binding(1) type(storage_buffer) alignment(64) offset(%c0) flags(ReadOnly) : !flow.dispatch.tensor<readonly:tensor<456x789xf16>>
        %2 = hal.interface.binding.subspan set(0) binding(2) type(storage_buffer) alignment(64) offset(%c0) : !flow.dispatch.tensor<readwrite:tensor<123x789xf16>>
        %3 = flow.dispatch.tensor.load %0, offsets = [0, 0], sizes = [123, 456], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<123x456xf16>> -> tensor<123x456xf16>
        %4 = flow.dispatch.tensor.load %1, offsets = [0, 0], sizes = [456, 789], strides = [1, 1] : !flow.dispatch.tensor<readonly:tensor<456x789xf16>> -> tensor<456x789xf16>
        %5 = flow.dispatch.tensor.load %2, offsets = [0, 0], sizes = [123, 789], strides = [1, 1] : !flow.dispatch.tensor<readwrite:tensor<123x789xf16>> -> tensor<123x789xf16>
        %6 = linalg.matmul ins(%3, %4 : tensor<123x456xf16>, tensor<456x789xf16>) outs(%5 : tensor<123x789xf16>) -> tensor<123x789xf16>
        flow.dispatch.tensor.store %6, %2, offsets = [0, 0], sizes = [123, 789], strides = [1, 1] : tensor<123x789xf16> -> !flow.dispatch.tensor<readwrite:tensor<123x789xf16>>
        return
      }
    }
  }
}
hanhanW commented 10 months ago

okay, thank you!