Open eamartin opened 8 years ago
I don't think we had planned to add these types ourselves, as most of our llvmlite development is being driven by Numba needs. Right now we're relying on the autovectorization passes to convert scalars to vectors for us, which has obvious limitations.
I should say, if someone does want to contribute this to llvmlite, we would be interested.
I was looking at the masked vector intrinsics and thought about what is needed for adding vector types. I will write down some notes here:
VectorType
can be easily made by copying ArrayType
insertelement
and extractelement
instructions (basically a copy of insertvalue
and extractvalue
)gep
for vector-of-pointers (see: http://blog.llvm.org/2011/12/llvm-31-vector-changes.html)For numba:
What would this look like on the Numba side? The easiest thing I could think of (or at least what I'd like) is that an inner loop of SIMD size (2, 4, 8 etc) without any fancy control flow could be marked as appropriate for vectorization, perhaps similar to prange?
In simple cases, Numba already benefits from LLVM autovectorization passes. We're working with Intel to enable LLVM to use SVML for SIMD vector math functions (when SVML is available) in the autovectorizer. Explicitly doing SIMD vector operations at the Python level is likely to be pretty clunky, so we're mostly interested in making sure the autovectorizer in LLVM can do as much as possible. (And this doesn't require the introduction of SIMD vector intrinisics in llvmlite.)
I asked on this issue, because autovec seems to currently work well from Clang but not Numba, e.g.
@numba.jit
def loop(a, b, c, out):
rec_b = float32(b / 10.0)
rec_c = float32(c * b / 42.0)
for i in range(1000):
for j in range(8):
out[i, j] = a + i * rec_b + i * rec_c
vs
void loop(float a, float b, float c, float *out)
{
float rec_b = b / 10.0;
float rec_c = c * b / 42.0;
for (int i=0; i<1000; i++)
for (int j=0; j<8; j++)
out[i*8 + j] = a + i * rec_b + i * rec_c;
}
In the former, the optimized IR is using regular floats, while in the latter, the IR shows work done on <8 x float>
s (using the -march=core-avx2
flag). Which is why I jumped on this issue, since I was guessing that this difference is actually in the Clang frontend and not the LLVM autovectorization passes, so if these can be expressed by Numba, vectorization would be easier to guarantee, instead of hoping that it's done automatically.
We have found in the past that subtle differences in the LLVM IR can result in LLVM optimization passes working or not. Since the Clang developers know these tricks, we frequently will inspect the LLVM IR output from Clang to learn about undocumented or underdocumented features.
Can you open a Numba issue with the example you listed above? We should see if we can copy what Clang is doing.
Are there any plans to eventually support LLVM vector types?
I've not personally used LLVM vector types, but they seem like a useful abstraction to target SIMD instructions.