nod-ai / iree-amd-aie

IREE plugin repository for the AMD AIE accelerator
Apache License 2.0
69 stars 30 forks source link

[Matmul+Truncf] Enable Matmul+Truncf for shorter shape on Pack-Peel + Objectfifo #822

Closed Abhishek-Varma closed 1 month ago

Abhishek-Varma commented 1 month ago

-- This commit includes arith.truncf, vector.transfer_read and vector.transfer_write into amdaie.core op. -- This is required to make "Matmul + truncf" work with vectorization enabled for arith.truncf op.

Signed-off-by: Abhishek Varma abhvarma@amd.com

Abhishek-Varma commented 1 month ago

Could you add an e2e test?

This is the first PR in the series of PRs that need to go in before I can add an e2e test for "Matmul + truncf" with the vectorization enabled for elementwise.

Abhishek-Varma commented 1 month ago

As discussed offline with @jtuyls - I have push most of the changes required in the same PR.

Will be raising a separate PR for flattening arith.truncf and once that goes in I'll add an e2e test in this PR.

Will rename the PR title and description accordingly - so marking this as a draft.

Abhishek-Varma commented 1 month ago

Here's the short shaped Matmul + truncf e2e IR log that this PR currently enables (the bigger shapes need to be addressed incrementally who e2e IR, in case someone wants to take a look, is here) and the numerics were verified locally.

But for the e2e test via cpu_comparisons/run.py I get the following for the CPU run itself let alone AIE :-

iree/runtime/src/iree/tooling/numpy_io.c:419:
   UNIMPLEMENTED; unsupported data encoding; outputting results; processing function outputs;
   `sync func @matmul_truncf(%input0: tensor<32x32xbf16>, %input1: tensor<32x32xbf16>) -> (%output0: tensor<32x32xbf16>)`

The above is not a verification issue at the MLIR level, else it'd have bailed out earlier.

I checked the input MLIR file that's getting generated locally but see no issue with it :-

// input 32x32xbf16
// input 32x32xbf16

func.func @matmul_truncf(%arg0: tensor<32x32xbf16>, %arg1: tensor<32x32xbf16>) -> tensor<32x32xbf16>
{
  %cst = arith.constant 0.0 : f32
  %0 = tensor.empty() : tensor<32x32xf32>
  %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<32x32xf32>) -> tensor<32x32xf32>
  %2 = linalg.matmul ins(%arg0, %arg1 : tensor<32x32xbf16>, tensor<32x32xbf16>)
    outs(%1: tensor<32x32xf32>) -> tensor<32x32xf32>
  %3 = arith.truncf %2 : tensor<32x32xf32> to tensor<32x32xbf16>
  return %3: tensor<32x32xbf16>
}

Am I missing some other template to include?

newling commented 1 month ago

Here's the short shaped Matmul + truncf e2e IR log that this PR currently enables (the bigger shapes need to be addressed incrementally who e2e IR, in case someone wants to take a look, is here) and the numerics were verified locally.

But for the e2e test via cpu_comparisons/run.py I get the following for the CPU run itself let alone AIE :-

iree/runtime/src/iree/tooling/numpy_io.c:419:
   UNIMPLEMENTED; unsupported data encoding; outputting results; processing function outputs;
   `sync func @matmul_truncf(%input0: tensor<32x32xbf16>, %input1: tensor<32x32xbf16>) -> (%output0: tensor<32x32xbf16>)`

The above is not a verification issue at the MLIR level, else it'd have bailed out earlier.

I checked the input MLIR file that's getting generated locally but see no issue with it :-

// input 32x32xbf16
// input 32x32xbf16

func.func @matmul_truncf(%arg0: tensor<32x32xbf16>, %arg1: tensor<32x32xbf16>) -> tensor<32x32xbf16>
{
  %cst = arith.constant 0.0 : f32
  %0 = tensor.empty() : tensor<32x32xf32>
  %1 = linalg.fill ins(%cst : f32) outs(%0 : tensor<32x32xf32>) -> tensor<32x32xf32>
  %2 = linalg.matmul ins(%arg0, %arg1 : tensor<32x32xbf16>, tensor<32x32xbf16>)
    outs(%1: tensor<32x32xf32>) -> tensor<32x32xf32>
  %3 = arith.truncf %2 : tensor<32x32xf32> to tensor<32x32xbf16>
  return %3: tensor<32x32xbf16>
}

Am I missing some other template to include?

No this is all correct. I think the issue is that iree-run-module doesn't support writing bfloat16 values. I'll investigate further.