Open yzhang93 opened 1 month ago
In contrast, bf16-f32 model (without arith.truncf %11 : f32 to bf16
) as below doesn't have such error.
!lhs = tensor<1024x512xbf16>
!rhs = tensor<512x1024xbf16>
!ele = tensor<1024x1024xf32>
!res = tensor<1024x1024xf32>
func.func @matmul_elementwise_bf16(%lhs : !lhs, %rhs : !rhs, %ele : !ele) -> !res {
%cst = arith.constant 0.0 : f32
%0 = tensor.empty() : !ele
%fill = linalg.fill ins(%cst : f32) outs(%0 : !ele) -> !ele
%2 = linalg.matmul ins(%lhs, %rhs : !lhs, !rhs) outs(%fill : !ele) -> !ele
%res = linalg.generic {indexing_maps = [affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>, affine_map<(d0, d1) -> (d0, d1)>], iterator_types = ["parallel", "parallel"]} ins(%2, %ele : !ele, !ele) outs(%0 : !ele) {
^bb0(%in: f32, %in_0: f32, %out: f32):
%11 = arith.addf %in, %in_0 : f32
linalg.yield %11 : f32
} -> !res
return %res : !res
}
@MaheshRavishankar @stephenneuendorffer @newling @erwei-xilinx Any insight about the issue?
I dont know if Peano handles bf16 natively.
I believe there's work going on to implement shuffle_vector. currently the assumption is that the vector ops always go through intrinsics. FYI, for Peano issues, you're better off capturing the .ll code and creating an issue in the peano repo.
Peano does support bf16 types, and there is indeed work to support more and more cases of generic shuffle_vector
. However, I think the problem here is rather that %1730:_(<1024 x s16>)
is a huge vector, and we do not have the capability yet to properly legalize those. As Stephen said, it would be very useful if you could get us a small .ll
reproducer, then we can investigate what's really happening here :)
Support for G_SHUFFLE_VECTOR
for Peano is soon under review, so that should land soonish. The failing instruction asks for 16-bit so it is not the support for bf in any case. There are two problems with the code as is:
G_SHUFFLE_VECTOR
incredibly slow since it needs to extract each value of the vector in turn and then reconstruct the vector by element. This instruction only changes the last 64 bytes, so it will do 32.640 bytes of useless memory operations. We can reduce this a lot by matching the patterns that you depend on and replace it with better instructions, but that does require us to know which G_SHUFFLE_VECTOR
masks are required. Thanks @stephenneuendorffer @gbossu @ValentijnvdBeek for looking into the issue! Here are the .ll
files generated from the above example. Please let me know if you need me to provide other sources.
input_ll.zip
Input IR
Error: