iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.48k stars 552 forks source link

[DT] Turns encodings into nop for all the backends that not yet support data-tiling #17719

Open hanhanW opened 1 week ago

hanhanW commented 1 week ago

To integrate data-tiling with multi-device and heterogeneous computing, we need to disable the early materialization pass in GlobalOptimization phase. Also, we are going to move set_encoding to the stage after dispatch formation. The early materialization pass won't work in many cases. To complete the support of data-tiling for all other backends, we add MaterializeEncodingIntoNopPass to their pipelines. This is what's happening in MaterializeHomogeneousEncodingsPass today, and we should be able to defer it to codegen for other pipelines.

https://github.com/iree-org/iree/blob/ac418d1f45d562bf9e9675bf69606c7d718e2432/compiler/src/iree/compiler/GlobalOptimization/MaterializeHomogeneousEncodings.cpp#L38-L45

E.g., on CPU side, it's added to buildLLVMCPUCodegenConfigurationPassPipelineImpl

https://github.com/iree-org/iree/blob/ac418d1f45d562bf9e9675bf69606c7d718e2432/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp#L752-L765

We can do the same for other backends. E.g., on LLVMGPU side, it'd be:

https://github.com/iree-org/iree/blob/ac418d1f45d562bf9e9675bf69606c7d718e2432/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp#L1041-L1051

note: this also needs to be done for vmvx and spirv backends. Like mentioned in the title, this needs to be done for all the backends.

This is an incremental step to enable gpu data-tiling.

hanhanW commented 1 week ago

I'm not able to create the repro, because it looks like we can handle the case at codegen level. @lialan can you help add the createMaterializeEncodingIntoNopPass to all the other backends?

The goal of the issue is making everything happy when we turn off early materialization pass:

https://github.com/iree-org/iree/blob/fe571e4d5efde141ec437cb4699307df67a38b9c/compiler/src/iree/compiler/GlobalOptimization/Passes.cpp#L38-L46

There is a separate issue besides nop pass. The issue I had is in linalg_quantized_matmul_vs_linalg_matmul.mlir. It looks like the upstream linalg shape inference drops the encodings, which is incorrect to me. @lialan can you help fix it and do the further investigations?

To repro: iree-compile --output-format=vm-bytecode --iree-hal-target-backends=llvm-cpu tests/e2e/regression/linalg_quantized_matmul_vs_linalg_matmul.mlir -o /tmp/a.vmfb --iree-global-opt-enable-early-materialization=false

(cc @bjacob )

hanhanW commented 1 week ago

This is the IR before and after canonicalization: https://gist.github.com/hanhanW/959cf2809098c3485ee1ebd6394e5836 Looking at check_one_quantized_matmul_as_matmul_dynamic function, the shape inference creates tensor.cast. Because it does not take encodings into account.

Before:

    %6 = iree_encoding.set_encoding %0 : tensor<?x?xi8> -> tensor<?x?xi8, #iree_encoding.encoding<role =  LHS, element_types = [i8, i8, i32], original_type = tensor<?x?xi8>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
    %7 = iree_encoding.set_encoding %1 : tensor<?x?xi8> -> tensor<?x?xi8, #iree_encoding.encoding<role =  RHS, element_types = [i8, i8, i32], original_type = tensor<?x?xi8>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
    %8 = tensor.empty(%c3, %c5) : tensor<?x?xi32, #iree_encoding.encoding<role =  RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
    %9 = linalg.fill ins(%c0_i32 : i32) outs(%8 : tensor<?x?xi32, #iree_encoding.encoding<role =  RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>) -> tensor<?x?xi32, #iree_encoding.encoding<role =  RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
    %10 = linalg.matmul ins(%6, %7 : tensor<?x?xi8, #iree_encoding.encoding<role =  LHS, element_types = [i8, i8, i32], original_type = tensor<?x?xi8>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>, tensor<?x?xi8, #iree_encoding.encoding<role =  RHS, element_types = [i8, i8, i32], original_type = tensor<?x?xi8>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>) outs(%9 : tensor<?x?xi32, #iree_encoding.encoding<role =  RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>) -> tensor<?x?xi32, #iree_encoding.encoding<role =  RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>

After:

    %5 = iree_encoding.set_encoding %0 : tensor<?x?xi8> -> tensor<?x?xi8, #iree_encoding.encoding<role =  LHS, element_types = [i8, i8, i32], original_type = tensor<?x?xi8>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
    %6 = iree_encoding.set_encoding %1 : tensor<?x?xi8> -> tensor<?x?xi8, #iree_encoding.encoding<role =  RHS, element_types = [i8, i8, i32], original_type = tensor<?x?xi8>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
    %7 = tensor.empty() : tensor<3x5xi32, #iree_encoding.encoding<role =  RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
    %8 = linalg.fill ins(%c0_i32 : i32) outs(%7 : tensor<3x5xi32, #iree_encoding.encoding<role =  RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>) -> tensor<3x5xi32, #iree_encoding.encoding<role =  RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
    %cast_2 = tensor.cast %5 : tensor<?x?xi8, #iree_encoding.encoding<role =  LHS, element_types = [i8, i8, i32], original_type = tensor<?x?xi8>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>> to tensor<3x?xi8>
    %cast_3 = tensor.cast %6 : tensor<?x?xi8, #iree_encoding.encoding<role =  RHS, element_types = [i8, i8, i32], original_type = tensor<?x?xi8>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>> to tensor<?x5xi8>
    %9 = linalg.matmul ins(%cast_2, %cast_3 : tensor<3x?xi8>, tensor<?x5xi8>) outs(%8 : tensor<3x5xi32, #iree_encoding.encoding<role =  RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>) -> tensor<3x5xi32, #iree_encoding.encoding<role =  RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>