numba / llvmlite

A lightweight LLVM python binding for writing JIT compilers
https://llvmlite.pydata.org/
BSD 2-Clause "Simplified" License
1.94k stars 319 forks source link

LLVM IR vector type support #211

Open eamartin opened 8 years ago

eamartin commented 8 years ago

Are there any plans to eventually support LLVM vector types?

I've not personally used LLVM vector types, but they seem like a useful abstraction to target SIMD instructions.

seibert commented 8 years ago

I don't think we had planned to add these types ourselves, as most of our llvmlite development is being driven by Numba needs. Right now we're relying on the autovectorization passes to convert scalars to vectors for us, which has obvious limitations.

seibert commented 8 years ago

I should say, if someone does want to contribute this to llvmlite, we would be interested.

sklam commented 8 years ago

I was looking at the masked vector intrinsics and thought about what is needed for adding vector types. I will write down some notes here:

For numba:

maedoc commented 7 years ago

What would this look like on the Numba side? The easiest thing I could think of (or at least what I'd like) is that an inner loop of SIMD size (2, 4, 8 etc) without any fancy control flow could be marked as appropriate for vectorization, perhaps similar to prange?

seibert commented 7 years ago

In simple cases, Numba already benefits from LLVM autovectorization passes. We're working with Intel to enable LLVM to use SVML for SIMD vector math functions (when SVML is available) in the autovectorizer. Explicitly doing SIMD vector operations at the Python level is likely to be pretty clunky, so we're mostly interested in making sure the autovectorizer in LLVM can do as much as possible. (And this doesn't require the introduction of SIMD vector intrinisics in llvmlite.)

maedoc commented 7 years ago

I asked on this issue, because autovec seems to currently work well from Clang but not Numba, e.g.

@numba.jit
def loop(a, b, c, out):
    rec_b = float32(b / 10.0)
    rec_c = float32(c * b / 42.0)
    for i in range(1000):
        for j in range(8):
            out[i, j] = a + i * rec_b + i * rec_c

vs

void loop(float a, float b, float c, float *out)
{
  float rec_b = b / 10.0;
  float rec_c = c * b / 42.0;
  for (int i=0; i<1000; i++)
    for (int j=0; j<8; j++)
      out[i*8 + j] = a + i * rec_b + i * rec_c;
}

In the former, the optimized IR is using regular floats, while in the latter, the IR shows work done on <8 x float>s (using the -march=core-avx2 flag). Which is why I jumped on this issue, since I was guessing that this difference is actually in the Clang frontend and not the LLVM autovectorization passes, so if these can be expressed by Numba, vectorization would be easier to guarantee, instead of hoping that it's done automatically.

seibert commented 7 years ago

We have found in the past that subtle differences in the LLVM IR can result in LLVM optimization passes working or not. Since the Clang developers know these tricks, we frequently will inspect the LLVM IR output from Clang to learn about undocumented or underdocumented features.

Can you open a Numba issue with the example you listed above? We should see if we can copy what Clang is doing.