intel / graph-compiler

MLIR-based toolkit targeting intel heterogeneous hardware
Apache License 2.0
30 stars 14 forks source link

OV single matmul case fail #382

Open WangJialei-A opened 1 day ago

WangJialei-A commented 1 day ago

Generate case:

python3 ./tools/mlir_bench/ov_model_gen.py -l='linear[512,512,512]' -t f16 -n test.xml

Reproduce:

OV_MLIR_MODE=GC_GPU ./bin/intel64/Release/benchmark_app -m ./test.xml -d GPU -use_device_mem -ip f16 -infer_precision f16 -niter 100 -hint none -nstreams 1 -nthreads 1

Error log:

 loc(fused<{name = "aten::linear/Add", type = "Add"}>["aten::linear/Add"]): error: operand #0 does not dominate this use
Error:  [build/_deps/gc-src/lib/gc/ExecutionEngine/GPURuntime/ocl/GpuOclRuntime.cpp:852] GPU pipeline failed!
Expected<T> must be checked before access or destruction.
Unchecked Expected<T> contained error:
GPU pipeline failed!/home/ghrunner/_work/_temp/41f17c38-0f55-4524-907e-899163017e00.sh: line 2: 70842 Aborted                 (core dumped) 
dchigarev commented 13 hours ago

The root cause of this failure is similar to #360 (memref.alloc() in GPU kernel), however the fix figured out in #360 won't work for this case :(

The MLIR generated by OV for the model from the reproducer (#382) looks like this:

MLIR with linalg.broadcast ```mlir module @fragment_name attributes {"#dlti.sys_spec" = #dlti.target_system_spec<"CPU" : #dlti.target_device_spec<#dlti.dl_entry<"tile_size", 32 : i32>>>} { func.func @entry(%arg0: memref<512x512xf16>, %arg1: memref<512x512xf16>, %arg2: memref<1x512xf16>, %arg3: memref<512x512xf16>) { %0 = bufferization.to_tensor %arg0 restrict : memref<512x512xf16> %1 = bufferization.to_tensor %arg1 restrict : memref<512x512xf16> %2 = bufferization.to_tensor %arg2 restrict : memref<1x512xf16> %3 = tensor.empty() : tensor<512x512xf16> %cst = arith.constant 0.000000e+00 : f16 %4 = linalg.fill ins(%cst : f16) outs(%3 : tensor<512x512xf16>) -> tensor<512x512xf16> %5 = linalg.matmul_transpose_b ins(%0, %1 : tensor<512x512xf16>, tensor<512x512xf16>) outs(%4 : tensor<512x512xf16>) -> tensor<512x512xf16> %collapsed = tensor.collapse_shape %2 [[0, 1]] : tensor<1x512xf16> into tensor<512xf16> %6 = tensor.empty() : tensor<512x512xf16> %broadcasted = linalg.broadcast ins(%collapsed : tensor<512xf16>) outs(%6 : tensor<512x512xf16>) dimensions = [0] %7 = tensor.empty() : tensor<512x512xf16> %8 = linalg.add ins(%5, %broadcasted : tensor<512x512xf16>, tensor<512x512xf16>) outs(%7 : tensor<512x512xf16>) -> tensor<512x512xf16> bufferization.materialize_in_destination %8 in restrict writable %arg3 : (tensor<512x512xf16>, memref<512x512xf16>) -> () return } } ```

The main difference from the modules we've been processing before, is that it takes bias as a 1D tensor (%arg2 : tensor<1x512>) that has to be broadcasted to a proper shape (512x512). linalg.broadcast is the operation that triggers memref.alloc() to be inserted to the GPU kernel:

MLIR module after 'one-shot-bufferize' pass ```mlir func.func @entry(%arg0: memref<512x512xf16>, %arg1: memref<512x512xf16>, %arg2: memref<1x512xf16>, %arg3: memref<512x512xf16>, %arg4: memref) { %cst = arith.constant 0.000000e+00 : f16 %alloc = memref.alloc() {alignment = 64 : i64} : memref<512x512xf16> %collapse_shape = memref.collapse_shape %arg2 [[0, 1]] : memref<1x512xf16> into memref<512xf16> %c0 = arith.constant 0 : index %c0_0 = arith.constant 0 : index %c512 = arith.constant 512 : index %c512_1 = arith.constant 512 : index %c32 = arith.constant 32 : index %c32_2 = arith.constant 32 : index scf.parallel (%arg5, %arg6) = (%c0, %c0_0) to (%c512, %c512_1) step (%c32, %c32_2) { %subview = memref.subview %arg0[%arg5, 0] [32, 512] [1, 1] : memref<512x512xf16> to memref<32x512xf16, strided<[512, 1], offset: ?>> %subview_3 = memref.subview %arg1[%arg6, 0] [32, 512] [1, 1] : memref<512x512xf16> to memref<32x512xf16, strided<[512, 1], offset: ?>> %subview_4 = memref.subview %alloc[%arg5, %arg6] [32, 32] [1, 1] : memref<512x512xf16> to memref<32x32xf16, strided<[512, 1], offset: ?>> linalg.fill ins(%cst : f16) outs(%subview_4 : memref<32x32xf16, strided<[512, 1], offset: ?>>) linalg.matmul_transpose_b ins(%subview, %subview_3 : memref<32x512xf16, strided<[512, 1], offset: ?>>, memref<32x512xf16, strided<[512, 1], offset: ?>>) outs(%subview_4 : memref<32x32xf16, strided<[512, 1], offset: ?>>) %subview_5 = memref.subview %collapse_shape[%arg6] [32] [1] : memref<512xf16> to memref<32xf16, strided<[1], offset: ?>> %alloc_6 = memref.alloc() {alignment = 64 : i64} : memref<32x32xf16> // allocating buffer for the result of 'linalg.broadcast' linalg.broadcast ins(%subview_5 : memref<32xf16, strided<[1], offset: ?>>) outs(%alloc_6 : memref<32x32xf16>) dimensions = [0] %subview_7 = memref.subview %arg3[%arg5, %arg6] [32, 32] [1, 1] : memref<512x512xf16> to memref<32x32xf16, strided<[512, 1], offset: ?>> linalg.add ins(%subview_4, %alloc_6 : memref<32x32xf16, strided<[512, 1], offset: ?>>, memref<32x32xf16>) outs(%subview_7 : memref<32x32xf16, strided<[512, 1], offset: ?>>) memref.dealloc %alloc_6 : memref<32x32xf16> scf.reduce } memref.dealloc %alloc : memref<512x512xf16> return } ```

We can potentially lower linalg.broadcast to vector.broadcast in linalg-to-xegpu pass like this and get rid of unnecessary allocation:

Lowering 'linalg.broadcast' to xegpu ```mlir %subview_5 = memref.subview %collapse_shape[%arg6] [32] [1] : memref<512xf16> to memref<32xf16, strided<[1], offset: ?>> // %alloc_6 = memref.alloc() {alignment = 64 : i64} : memref<32x32xf16> // allocating buffer for the result of 'linalg.broadcast' // linalg.broadcast ins(%subview_5 : memref<32xf16, strided<[1], offset: ?>>) outs(%alloc_6 : memref<32x32xf16>) dimensions = [0] %expand_shape = memref.expand_shape %subview_5 [[0, 1]] output_shape [32, 1] : memref<32xf16, strided<[1], offset: ?>> into memref<32x1xf16, strided<[1, 1], offset: ?>> %51 = xegpu.create_nd_tdesc %expand_shape[%c0, %c0] : memref<32x1xf16, strided<[1, 1], offset: ?>> -> !xegpu.tensor_desc<32x1xf16, #xegpu.block_tdesc_attr> %52 = xegpu.update_nd_offset %51, [%c0, %c0] : !xegpu.tensor_desc<32x1xf16, #xegpu.block_tdesc_attr> %53 = xegpu.load_nd %52 : !xegpu.tensor_desc<32x1xf16, #xegpu.block_tdesc_attr> -> vector<32x1xf16> %54 = vector.broadcast %53 : vector<32x1xf16> to vector<32x32xf16> ```

However the chain of memref.subview + memref.expand_shape cannot be lowered into xegpu.create_nd_desc + xegpu.update_nd_desc by the xegpu-fold-alias-ops pass which eventually causes the pipeline to fail on imex-convert-gpu-to-spirv pass (error: failed to legalize operation 'memref.subview')

In order to fix that problem we have to get rid of memref.collapse_shape (outside the loop) and memref.expand_shape (inside the loop). This should probably be a separate pass since this logic seems to not be related to linalg-to-xegpu lowering (or maybe there's already an upstream pass that does the same?):

Getting rid of collapse_shape + extend_shape ```mlir // ORIGINAL: func.func @entry(%arg0: memref<512x512xf16>, %arg1: memref<512x512xf16>, %arg2: memref<1x512xf16>, %arg3: memref<512x512xf16>, %arg4: memref) { ... // have to get rid of this collapse_shape %collapse_shape = memref.collapse_shape %arg2 [[0, 1]] : memref<1x512xf16> into memref<512xf16> scf.parallel (%arg5, %arg6) = (%c0, %c0) to (%c512, %c512) step (%c32, %c32) { ... %subview_2 = memref.subview %collapse_shape[%arg6] [32] [1] : memref<512xf16> to memref<32xf16, strided<[1], offset: ?>> // have to get rid of this expand_shape %expand_shape = memref.expand_shape %subview_2 [[0, 1]] output_shape [32, 1] : memref<32xf16, strided<[1], offset: ?>> into memref<32x1xf16, strided<[1, 1], offset: ?>> %51 = xegpu.create_nd_tdesc %expand_shape[%c0, %c0] : memref<32x1xf16, strided<[1, 1], offset: ?>> -> !xegpu.tensor_desc<32x1xf16, #xegpu.block_tdesc_attr> %52 = xegpu.update_nd_offset %51, [%c0, %c0] : !xegpu.tensor_desc<32x1xf16, #xegpu.block_tdesc_attr> %53 = xegpu.load_nd %52 : !xegpu.tensor_desc<32x1xf16, #xegpu.block_tdesc_attr> -> vector<32x1xf16> %54 = vector.broadcast %53 : vector<32x1xf16> to vector<32x32xf16> ... } return } // EXPECTED: func.func @entry(%arg0: memref<512x512xf16>, %arg1: memref<512x512xf16>, %arg2: memref<1x512xf16>, %arg3: memref<512x512xf16>, %arg4: memref) { ... // removed collapse_shape // %collapse_shape = memref.collapse_shape %arg2 [[0, 1]] : memref<1x512xf16> into memref<512xf16> scf.parallel (%arg5, %arg6) = (%c0, %c0) to (%c512, %c512) step (%c32, %c32) { ... // broadcast %arg2 directly, no need to call 'expand_shape' %subview_2 = memref.subview %arg2[%c0, %arg6] [32, 32] [1, 1] : memref<1x512xf16> to memref<1x32xf16, strided<[1, 1], offset: ?>> %51 = xegpu.create_nd_tdesc %subview_2[%c0, %c0] : memref<1x32xf16, strided<[1, 1], offset: ?>> -> !xegpu.tensor_desc<32x1xf16, #xegpu.block_tdesc_attr> %52 = xegpu.update_nd_offset %51, [%c0, %c0] : !xegpu.tensor_desc<32x1xf16, #xegpu.block_tdesc_attr> %53 = xegpu.load_nd %52 : !xegpu.tensor_desc<32x1xf16, #xegpu.block_tdesc_attr> -> vector<32x1xf16> %54 = vector.broadcast %53 : vector<32x1xf16> to vector<32x32xf16> ... } return } ```

cc @kurapov-peter