Is SIMDe an appropriate "target" for numerical programming?

(Sorry for the "user mailing list" question here---I took a look at the discussion group and it didn't appear to be very active, so decided to post an issue instead.)

I'm trying to determine what the target use case of SIMDe is and whether it would fit my needs. I research and develop numerical algorithms. Where I can, I use BLAS, LAPACK, etc to speed up dense matrix operations. Often, the code I want to write doesn't fit into this paradigm neatly, which means I end up writing a lot of bespoke C code which I want to vectorize. I've looked into OpenMP SIMD, Intel ISPC, and OpenCL as possible tools for the job. It appears that SIMDe would also be appropriate. Is it? Or is this not an intended use case? Especially since SIMDe uses OpenMP SIMD under the hood, I would be very interested to hear anyone's thoughts and suggestions. Thanks!

No problem; questions here are fine. I am going to go ahead and close this issue since it's not really a bug, but please feel free to keep chatting here :)

There are obviously trade-offs with all of the possibilities you mentioned, including SIMDe, but yes depending on exactly what you're doing I think it could be an appropriate target. You basically get all the benefits of writing an architecture-specific implementation using "normal" intrinsics, but better portability.

If you write your code using SIMDe's x86 functions, it will run at full speed if your target supports those functions natively (e.g., simde_mm_add_ps has exactly the same performance as _mm_add_ps if your target supports SSE). Assuming your implementation is good, you're going to get optimal performance when running on that platform... all SIMDe really does is make it possible to compile the code on other targets without a rewrite. OTOH, if you do target a different platform you're not generally going to get optimal performance there; _mm_add_ps is a bad example since everyone has a function to add 4 floats, but there are plenty of examples where the semantics don't match up perfectly and you may end up with some sub-optimal implementations.

For example, let's say you wanted the absolute value of a 32-bit integer, but only had SSE2 available. In that case, you might end up doing something like:

simde__m128i m = simde_mm_cmpgt_epi32(simde_mm_setzero_si128(), a);
return simde_mm_sub_epi32(_mm_xor_si128(a, m), m);

In this case, if you're compiling for NEON, we have to map each function call to it's NEON equivalent which would be a bit like:

int32x4_t m = vcgtq_s32(vdupq_n_s32(0), a);
return vsubq_s32(veorq_s32(a, m), m);

There is actually a bit missing in there since vcgtq_s32 returns a uint32x4_t, but converting happens at the language level only, no instruction is generated.

If you were just writing native NEON you would probably have just done:

return vabsq_s32(a);

Obviously that would be better, but since SIMDe has to translate one function at a time we have no way of knowing that what you're really trying to do there is get an absolute value. This is really the biggest disadvantage of SIMDe in my mind.

Now, if you were using SIMDe you wouldn't really be constrained to SSE2 and could use the SSSE3:

return simde_mm_abs_epi32(a);

Which SIMDe would be able to compile to a simple vabsq_s32 call on NEON. On SSE2 we would use something like the first example above, so you're no worse off. Basically, I recommend you use newer ISA extensions if they're a better fit for what you're really trying to do even if you don't expect to have them available.

I don't see a reason to ever use "raw" SIMD instrinsics instead of SIMDe... there is basically no advantage and the portability is worse. Even if you don't take advantage of the portability in a traditional way, you can just think of later ISA extensions as a convenience library with a lot of extra functionality.

Comparing with OpenMP SIMD, ISPC, and OpenCL is a bit more complicated. ISPC is a really cool option... the targets are a lot more limited and you're not likely to get quite as much performance as with intrinsics/SIMDe, but you're probably going to have an easier time writing the code.

OpenCL is generally used pretty differently... since it's designed to be able to operate on GPUs as well it's a bit more restricted (there are a lot of SIMD instructions which don't have OpenCL counterparts). Also, since it's expensive to hand data off to a GPU (or other coprocessor) you're really not supposed to be constantly jumping in and out of OpenCL code; you generally want to either do everything in OpenCL or nothing.

OpenMP SIMD is interesting. Yes, SIMDe uses OpenMP SIMD under the hood, but as a fallback... it's generally much faster to call the "right" intrinsics for whatever the compiler is targeting than rely on OpenMP SIMD. It works very well for basic stuff, but it's not really expressive enough to take advantage of the more complex instructions. Going back to the absolute value example, you're probably going to get something code like the slow version everywhere. For example, the OpenMP implementation for absolute value might be something like:

#pragma omp simd
for (size_t i = 0 ; i < 4 ; i++) {
  r[i] = a[i] < 0 ? -a[i] : a[i];
}

Even assuming you're careful about alignment and everything, a given compiler may or may not be able to recognize that pattern as an absolute value operation. Using simde_mm_abs_epi32 (or simde_vabsq_s32, etc.) is a much safer bet since SIMDe will know that what you're after is an absolute value operation. Only in the event SIMDe doesn't know of a really fast way to handle that on whatever your compiler is targeting will SIMDe fall back on OpenMP as a last-ditch attempt to get the compiler to do the work for us.

OpenMP SIMD does have some big advantages, though. For one thing, the code is often much easier to read to those who haven't memorized the behavior of the thousands of SIMD functions in modern SIMD ISA extensions. You'll generally leave a lot of performance on the table, but if you've already created a prototype using normal C code then adding OpenMP pragmas to a few loops is often a great way to get a significant performance boost with almost no effort.

Of all the options, I'd say that intrinsics/SIMDe is the most difficult to write but (assuming you know what you're doing, which is a high bar here) can provide the best performance gains. OpenMP is the easiest, but likely the poorest performing of the options. OpenCL is a bit awkward... TBH I'd really only use it if I were targeting GPUs but wanted to be able to fall back on the CPU. ISPC is sort of between OpenMP and intrinsics/SIMDe.

I should also mention one last interesting benefit of SIMDe: you can mix code for different architectures pretty easily. For example, you can start with an x86 implementation. When you compile for Arm, you can look for places where your code is spending a lot of time (either because it's particularly hot or the implementation is particularly slow) and rewrite only those sections using NEON intrinsics (with some ifdefs to choose the implementation). Eventually you can rewrite the whole thing if you want, but even if you do that at least you have working (if slow) code to start from, which makes debugging much easier.

Wow, thanks for the thorough response!

Your appraisal of the different targets I mentioned matches my own. My main goals are performance and making sure my code compiles quickly and with little user effort. It sounds like using SIMDe should address both these points.

Your point about using "future" intrinsics as an extension library is helpful. This will definitely streamline using SIMDe.

Thanks for clarifying the point about OpenMP SIMD. It's a nice looking feature, but good to know that its performance isn't there at this point. Makes sense. I do use OpenMP for parallelism as opposed to vectorization.

You'll generally leave a lot of performance on the table, but if you've already created a prototype using normal C code then adding OpenMP pragmas to a few loops is often a great way to get a significant performance boost with almost no effort.

Assuming you're talking about OpenMP SIMD here... I'm a little surprised that slapping #pragma omp simd on a loop would do much more than a compiler's auto vectorization, although I don't have a mental model for what OpenMP SIMD does under the hood.

You'll generally leave a lot of performance on the table, but if you've already created a prototype using normal C code then adding OpenMP pragmas to a few loops is often a great way to get a significant performance boost with almost no effort.

Assuming you're talking about OpenMP SIMD here...

Yes, sorry I should have been clearer there :(

I'm a little surprised that slapping #pragma omp simd on a loop would do much more than a compiler's auto vectorization, although I don't have a mental model for what OpenMP SIMD does under the hood.

Generally yes, #pragma omp simd alone doesn't do much. On some compilers it alters the calculation of when to vectorize a loop; if the cost model would otherwise make the compiler assume that vectorization wasn't worth it #pragma omp simd will encourage the compiler to vectorize anyways. AFAIK on ICC it will effectively force vectorization, whereas my understanding is that it's more of a suggestion to GCC and clang. In isolation the compilers will generally do the right thing (vectorize) on their own, so it doesn't really help with micro benchmarks. At a higher level though, when the compiler has a lot more code to look at and it's a bit harder for a human to keep track of it tends to have more of an effect. Unfortunately this isn't really documented anywhere AFAIK, so it's hard to say for certain.

Of course it's not always just #pragma omp simd. #pragma omp simd reduction(...) can be really important, as can #pragma omp simd safe_len(...) (honestly safe_len doesn't come up in SIMD very often, but if you were writing your code with OpenMP SIMD annotations it is often pretty critical).

It's also worth noting that we don't just blindly use #pragma omp simd everywhere either... we actually test to see if OpenMP is enabled (see the OpenMP 4 SIMD section of the README), and if not we try to use compiler-specific pragmas like #pragma clang loop vectorize(enable). OpenMP SIMD is the preferred option, though, so if your compiler supports it that's generally the way to go.

FWIW, the Performance Tuning page in the wiki is probably also worth a skim.

@nemequ, thanks for the elaborate post above!

a lot of SIMD instructions which don't have OpenCL counterparts it's expensive to hand data off to a GPU (or other coprocessor) you're really not supposed to be constantly jumping in and out

Are those points valid also for Vulkan (and D3D12, Metal, WebGPU) and CUDA (and AMD ROCm, Intel oneAPI)?

a lot of SIMD instructions which don't have OpenCL counterparts it's expensive to hand data off to a GPU (or other coprocessor) you're really not supposed to be constantly jumping in and out

Are those points valid also for Vulkan (and D3D12, Metal, WebGPU) and CUDA (and AMD ROCm, Intel oneAPI)?

Yes

simd-everywhere / simde

Is SIMDe an appropriate "target" for numerical programming? #943