Open qedawkins opened 2 months ago
Problem Description
The current implementation of this pass searches for shared memory allocations, pads them along the inner most dimension, and the propagates changes to the memref type to all consumers. This approach has a few issues:
- Updating/propagating the memref type is not always possible (i.e. if the
memref.alloc()
is used by amemref.collapse_shape
op). This can lead to crashes if propagation fails.
I dont think so. It should be always possible to do this. I'd like some more details as to why not.
- This approach is overfit to matmul pipelines and does not analyze actual access patterns.
- Higher dimensional allocations with small inner dimensions (e.g.
memref<16x16x4xf32>
) will be indiscriminately padded along the inner dimension, introducing too much padding.To address this we should try to do such padding earlier in the compilation pipeline on tensors when both 1) We can observe the access pattern for the allocation and 2) No propagation of types is needed.
Solutions
The first potential solution to try is simply padding the
bufferization.alloc_tensor
when we create it and addingtensor.extract_slice
to get back to the original shape, e.g.%dest = bufferization.alloc_tensor() : tensor<64x68xf32> %slice = tensor.extract_slice %dest : tensor<64x68xf32> to tensor<64x64xf32> scf.forall ... shared_outs(%init = %slice)
This approach is susceptible to variance in bufferization and any potential future canonicalizations that would fold away such slice + allocations to a smaller allocation.
Such a folding should be illegal. This is precisely why this op exists... In any case at least it shouldnt be a canonicalization.
Another option is to introduce a new allocation op similar to
bufferization.alloc_tensor
that include the slicing semantics, i.e.%dest = iree_codegen.alloc_tensor() [64, 68] : tensor<64x64xf32> scf.forall ... shared_outs(%init = %dest)
I want to avoid using the crutch of "just add a new op".
Problem Description
The current implementation of this pass searches for shared memory allocations, pads them along the inner most dimension, and the propagates changes to the memref type to all consumers. This approach has a few issues:
- Updating/propagating the memref type is not always possible (i.e. if the
memref.alloc()
is used by amemref.collapse_shape
op). This can lead to crashes if propagation fails.I dont think so. It should be always possible to do this. I'd like some more details as to why not.
The main culprit is memref.collapse_shape
. For example
%0 = memref.alloc() : memref<64x64xf32, strided = [64, 1], offset: 0>
%1 = memref.collapse_shape %0 : memref<64x64xf32, strided = [64, 1], offset: 0> to memref<4096xf32, strided = [1], offset: 0>
but trying to pad the inner dim of %0
would lead to a non-contiguous memref.collapse_shape
op
%0 = memref.alloc() : memref<64x68xf32, strided = [68, 1], offset: 0>
%1 = memref.subview %0 [0, 0] [64, 64] [1, 1] : memref<64x68xf32, strided = [68, 1], offset: 0> to memref<64x64xf32, strided = [68, 1], offset: 0>
%1 = memref.collapse_shape %1 // Cannot collapse because of non-contiguous subview
Problem Description
The current implementation of this pass searches for shared memory allocations, pads them along the inner most dimension, and the propagates changes to the memref type to all consumers. This approach has a few issues:
memref.alloc()
is used by amemref.collapse_shape
op). This can lead to crashes if propagation fails.memref<16x16x4xf32>
) will be indiscriminately padded along the inner dimension, introducing too much padding.To address this we should try to do such padding earlier in the compilation pipeline on tensors when both 1) We can observe the access pattern for the allocation and 2) No propagation of types is needed.
Solutions
The first potential solution to try is simply padding the
bufferization.alloc_tensor
when we create it and addingtensor.extract_slice
to get back to the original shape, e.g.This approach is susceptible to variance in bufferization and any potential future canonicalizations that would fold away such slice + allocations to a smaller allocation. Another option is to introduce a new allocation op similar to
bufferization.alloc_tensor
that include the slicing semantics, i.e.