iree-org / iree

A retargetable MLIR-based machine learning compiler and runtime toolkit.
http://iree.dev/
Apache License 2.0
2.82k stars 608 forks source link

[Codegen][GPU] Rework GPUReduceSharedMemoryBankConflicts pass #18393

Open qedawkins opened 2 months ago

qedawkins commented 2 months ago

Problem Description

The current implementation of this pass searches for shared memory allocations, pads them along the inner most dimension, and the propagates changes to the memref type to all consumers. This approach has a few issues:

  1. Updating/propagating the memref type is not always possible (i.e. if the memref.alloc() is used by a memref.collapse_shape op). This can lead to crashes if propagation fails.
  2. This approach is overfit to matmul pipelines and does not analyze actual access patterns.
  3. Higher dimensional allocations with small inner dimensions (e.g. memref<16x16x4xf32>) will be indiscriminately padded along the inner dimension, introducing too much padding.

To address this we should try to do such padding earlier in the compilation pipeline on tensors when both 1) We can observe the access pattern for the allocation and 2) No propagation of types is needed.

Solutions

The first potential solution to try is simply padding the bufferization.alloc_tensor when we create it and adding tensor.extract_slice to get back to the original shape, e.g.

%dest = bufferization.alloc_tensor() : tensor<64x68xf32>
%slice = tensor.extract_slice %dest : tensor<64x68xf32> to tensor<64x64xf32>
scf.forall ... shared_outs(%init = %slice)

This approach is susceptible to variance in bufferization and any potential future canonicalizations that would fold away such slice + allocations to a smaller allocation. Another option is to introduce a new allocation op similar to bufferization.alloc_tensor that include the slicing semantics, i.e.

%dest = iree_codegen.alloc_tensor() [64, 68] : tensor<64x64xf32>
scf.forall ... shared_outs(%init = %dest)
MaheshRavishankar commented 2 months ago

Problem Description

The current implementation of this pass searches for shared memory allocations, pads them along the inner most dimension, and the propagates changes to the memref type to all consumers. This approach has a few issues:

  1. Updating/propagating the memref type is not always possible (i.e. if the memref.alloc() is used by a memref.collapse_shape op). This can lead to crashes if propagation fails.

I dont think so. It should be always possible to do this. I'd like some more details as to why not.

  1. This approach is overfit to matmul pipelines and does not analyze actual access patterns.
  2. Higher dimensional allocations with small inner dimensions (e.g. memref<16x16x4xf32>) will be indiscriminately padded along the inner dimension, introducing too much padding.

To address this we should try to do such padding earlier in the compilation pipeline on tensors when both 1) We can observe the access pattern for the allocation and 2) No propagation of types is needed.

Solutions

The first potential solution to try is simply padding the bufferization.alloc_tensor when we create it and adding tensor.extract_slice to get back to the original shape, e.g.

%dest = bufferization.alloc_tensor() : tensor<64x68xf32>
%slice = tensor.extract_slice %dest : tensor<64x68xf32> to tensor<64x64xf32>
scf.forall ... shared_outs(%init = %slice)

This approach is susceptible to variance in bufferization and any potential future canonicalizations that would fold away such slice + allocations to a smaller allocation.

Such a folding should be illegal. This is precisely why this op exists... In any case at least it shouldnt be a canonicalization.

Another option is to introduce a new allocation op similar to bufferization.alloc_tensor that include the slicing semantics, i.e.

%dest = iree_codegen.alloc_tensor() [64, 68] : tensor<64x64xf32>
scf.forall ... shared_outs(%init = %dest)

I want to avoid using the crutch of "just add a new op".

qedawkins commented 2 months ago

Problem Description

The current implementation of this pass searches for shared memory allocations, pads them along the inner most dimension, and the propagates changes to the memref type to all consumers. This approach has a few issues:

  1. Updating/propagating the memref type is not always possible (i.e. if the memref.alloc() is used by a memref.collapse_shape op). This can lead to crashes if propagation fails.

I dont think so. It should be always possible to do this. I'd like some more details as to why not.

The main culprit is memref.collapse_shape. For example

%0 = memref.alloc() : memref<64x64xf32, strided = [64, 1], offset: 0>
%1 = memref.collapse_shape %0 : memref<64x64xf32, strided = [64, 1], offset: 0> to memref<4096xf32, strided = [1], offset: 0>

but trying to pad the inner dim of %0 would lead to a non-contiguous memref.collapse_shape op

%0 = memref.alloc() : memref<64x68xf32, strided = [68, 1], offset: 0>
%1 = memref.subview %0 [0, 0] [64, 64] [1, 1] : memref<64x68xf32, strided = [68, 1], offset: 0> to memref<64x64xf32, strided = [68, 1], offset: 0>
%1 = memref.collapse_shape %1 // Cannot collapse because of non-contiguous subview