Open joesavage opened 4 years ago
However, given that this is quite a complex pattern to peephole optimise, it feels like maybe it should belong somewhere else, perhaps in simplification. Of course, I don't think that's possible today without supporting multi-dimensional broadcasts or casts between vectors of different lengths, but maybe it's a better long term solution.
This is very close to being merged (#4873). I think once this is merged, the best way to fix this is for the simplifier to rewrite interleave(broadcast(x1), broadcast(x2), ...)
to broadcast(interleave(x1, x2, ...))
.
We might then still need to make sure that generates good code, but this seems more straightforward, no pattern matching required. It also seems familiar from vrmpy codegen on Hexagon, which does work (concats of scalar loads do get optimized the way we need them to here, using the same mechanism of "interleaving" the scalars).
I've been doing some prototyping recently around Halide GEMM kernels for AArch64, and in doing so, seem to have uncovered some code generation issues around interleaves of broadcasts.
To illustrate this, I've put together the following test case:
If we compile this with the
atomic_vectorization
branch from #4628, we end up with something like the following Halide IR in the inner loop:Zooming in on the first input to the multiply, with elements loaded from
A_in
, we see that the interleave here actually represents a repeating pattern. We might intuitively think about this asx4(ramp(t21, 1, 4))
, where the resultant vector contains the patternA_in[t21], A_in[t21 + 1], A_in[t21 + 2]
repeated four times. Since Halide doesn't yet support generating multi-dimensional broadcasts like this, however, we're stuck with this slightly odd interleave-of-broadcasts representation.Unfortunately, as a result of this, the compiler as of today generates the following LLVM IR:
This vastly over-complicates what is really quite a simple operation. As a result, I'm seeing instructions like
ldrb
,zip1
,zip2
, and individual lane loads and moves in the final assembly, when we really just want a vector load followed by some indexing. In this case, probably something like the following LLVM IR:In my prototyping, I'm currently working around this with some hacked together code in the back-end that detects shuffles of this type and emits the right IR. However, given that this is quite a complex pattern to peephole optimise, it feels like maybe it should belong somewhere else, perhaps in simplification. Of course, I don't think that's possible today without supporting multi-dimensional broadcasts or casts between vectors of different lengths, but maybe it's a better long term solution.
Does anyone have any thoughts on how this should be improved, or indeed whether there's a better way around this that I'm not seeing? If it's doesn't end up being a huge task, I'm happy to work on this myself.