Open hanhanW opened 1 week ago
I'm not able to create the repro, because it looks like we can handle the case at codegen level. @lialan can you help add the createMaterializeEncodingIntoNopPass
to all the other backends?
The goal of the issue is making everything happy when we turn off early materialization pass:
There is a separate issue besides nop pass. The issue I had is in linalg_quantized_matmul_vs_linalg_matmul.mlir. It looks like the upstream linalg shape inference drops the encodings, which is incorrect to me. @lialan can you help fix it and do the further investigations?
To repro: iree-compile --output-format=vm-bytecode --iree-hal-target-backends=llvm-cpu tests/e2e/regression/linalg_quantized_matmul_vs_linalg_matmul.mlir -o /tmp/a.vmfb --iree-global-opt-enable-early-materialization=false
(cc @bjacob )
This is the IR before and after canonicalization: https://gist.github.com/hanhanW/959cf2809098c3485ee1ebd6394e5836 Looking at check_one_quantized_matmul_as_matmul_dynamic
function, the shape inference creates tensor.cast. Because it does not take encodings into account.
Before:
%6 = iree_encoding.set_encoding %0 : tensor<?x?xi8> -> tensor<?x?xi8, #iree_encoding.encoding<role = LHS, element_types = [i8, i8, i32], original_type = tensor<?x?xi8>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
%7 = iree_encoding.set_encoding %1 : tensor<?x?xi8> -> tensor<?x?xi8, #iree_encoding.encoding<role = RHS, element_types = [i8, i8, i32], original_type = tensor<?x?xi8>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
%8 = tensor.empty(%c3, %c5) : tensor<?x?xi32, #iree_encoding.encoding<role = RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
%9 = linalg.fill ins(%c0_i32 : i32) outs(%8 : tensor<?x?xi32, #iree_encoding.encoding<role = RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>) -> tensor<?x?xi32, #iree_encoding.encoding<role = RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
%10 = linalg.matmul ins(%6, %7 : tensor<?x?xi8, #iree_encoding.encoding<role = LHS, element_types = [i8, i8, i32], original_type = tensor<?x?xi8>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>, tensor<?x?xi8, #iree_encoding.encoding<role = RHS, element_types = [i8, i8, i32], original_type = tensor<?x?xi8>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>) outs(%9 : tensor<?x?xi32, #iree_encoding.encoding<role = RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>) -> tensor<?x?xi32, #iree_encoding.encoding<role = RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
After:
%5 = iree_encoding.set_encoding %0 : tensor<?x?xi8> -> tensor<?x?xi8, #iree_encoding.encoding<role = LHS, element_types = [i8, i8, i32], original_type = tensor<?x?xi8>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
%6 = iree_encoding.set_encoding %1 : tensor<?x?xi8> -> tensor<?x?xi8, #iree_encoding.encoding<role = RHS, element_types = [i8, i8, i32], original_type = tensor<?x?xi8>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
%7 = tensor.empty() : tensor<3x5xi32, #iree_encoding.encoding<role = RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
%8 = linalg.fill ins(%c0_i32 : i32) outs(%7 : tensor<3x5xi32, #iree_encoding.encoding<role = RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>) -> tensor<3x5xi32, #iree_encoding.encoding<role = RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
%cast_2 = tensor.cast %5 : tensor<?x?xi8, #iree_encoding.encoding<role = LHS, element_types = [i8, i8, i32], original_type = tensor<?x?xi8>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>> to tensor<3x?xi8>
%cast_3 = tensor.cast %6 : tensor<?x?xi8, #iree_encoding.encoding<role = RHS, element_types = [i8, i8, i32], original_type = tensor<?x?xi8>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>> to tensor<?x5xi8>
%9 = linalg.matmul ins(%cast_2, %cast_3 : tensor<3x?xi8>, tensor<?x5xi8>) outs(%8 : tensor<3x5xi32, #iree_encoding.encoding<role = RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>) -> tensor<3x5xi32, #iree_encoding.encoding<role = RESULT, element_types = [i8, i8, i32], original_type = tensor<?x?xi32>, user_indexing_maps = [#map, #map1, #map2], round_dims_to = array<i64: 16, 16, 16>>>
To integrate data-tiling with multi-device and heterogeneous computing, we need to disable the early materialization pass in GlobalOptimization phase. Also, we are going to move set_encoding to the stage after dispatch formation. The early materialization pass won't work in many cases. To complete the support of data-tiling for all other backends, we add MaterializeEncodingIntoNopPass to their pipelines. This is what's happening in MaterializeHomogeneousEncodingsPass today, and we should be able to defer it to codegen for other pipelines.
https://github.com/iree-org/iree/blob/ac418d1f45d562bf9e9675bf69606c7d718e2432/compiler/src/iree/compiler/GlobalOptimization/MaterializeHomogeneousEncodings.cpp#L38-L45
E.g., on CPU side, it's added to
buildLLVMCPUCodegenConfigurationPassPipelineImpl
https://github.com/iree-org/iree/blob/ac418d1f45d562bf9e9675bf69606c7d718e2432/compiler/src/iree/compiler/Codegen/LLVMCPU/Passes.cpp#L752-L765
We can do the same for other backends. E.g., on LLVMGPU side, it'd be:
https://github.com/iree-org/iree/blob/ac418d1f45d562bf9e9675bf69606c7d718e2432/compiler/src/iree/compiler/Codegen/LLVMGPU/Passes.cpp#L1041-L1051
note: this also needs to be done for vmvx and spirv backends. Like mentioned in the title, this needs to be done for all the backends.
This is an incremental step to enable gpu data-tiling.