intel / graph-compiler

MLIR-based toolkit targeting intel heterogeneous hardware
Apache License 2.0
31 stars 15 forks source link

`memref.alloc()` on dynamic shape tensor cannot be successfully lowered #377

Open yifeizh2 opened 2 weeks ago

yifeizh2 commented 2 weeks ago

Encountered in the following matmul config

module attributes {dlti.target_system_spec = #dlti.target_system_spec<"CPU" : #dlti.target_device_spec<#dlti.dl_entry<"L1_cache_size_in_bytes", 49152 : ui32>, #dlti.dl_entry<"L2_cache_size_in_bytes", 2097152 : ui64>, #dlti.dl_entry<"L3_cache_size_in_bytes", 110100480 : ui64>, #dlti.dl_entry<"num_threads", 56 : i32>, #dlti.dl_entry<"max_vector_width", 512 : i64>>>} {
  func.func @entry(%arg0: tensor<128x512xbf16>, %arg1: tensor<512x1024xbf16>) -> tensor<128x1024xbf16> attributes {llvm.emit_c_interface} {
    %cst = arith.constant 0.000000e+00 : bf16
    %0 = tensor.empty() : tensor<128x1024xbf16>
    %1 = linalg.fill ins(%cst : bf16) outs(%0 : tensor<128x1024xbf16>) -> tensor<128x1024xbf16>
    %2 = linalg.matmul {KBlock = 32 : i32, KThreads = 1 : i32, MBlock = 32 : i32, MThreads = 4 : i32, NBlock = 128 : i32, NThreads = 14 : i32, cast = #linalg.type_fn<cast_signed>, innermostKBlock = 32 : i32, innermostMBlock = 32 : i32, innermostNBlock = 32 : i32} ins(%arg0, %arg1 : tensor<128x512xbf16>, tensor<512x1024xbf16>) outs(%1 : tensor<128x1024xbf16>) -> tensor<128x1024xbf16>
    return %2 : tensor<128x1024xbf16>
  }
}

After one-shot bufferization, we encounter the following

%alloc_3 = memref.alloc(%6) {alignment = 64 : i64} : memref<32x?xf32>

which further being lowered to un-eliminable builtin.unrealized_conversion_cast.

yifeizh2 commented 2 weeks ago

Offline synced with @zhczhong, the issue is caused by brgemm encountering dynamic stride, and we have existing logic in prepareConfigCandidates to filter out such invalid config. Now the same logic is moved to validateConfig to also ensure tuner-generated configs' validity.