simd-everywhere / simde

Implementations of SIMD instruction sets for systems which don't natively support them.
https://simd-everywhere.github.io/blog/
MIT License
2.34k stars 243 forks source link

Related project #624

Closed p0nce closed 3 years ago

p0nce commented 3 years ago

intel-intrinsics is the implementation of Intel intrinsics but for the D programming language. It is surprisingly similar to simde in goals except much less advanced, as I only do MMX/SSE/SSE2/SSE3 for x86 / x86_64 / ARMv7 / ARMv8. (so: about 400 intrinsics) It only support x86 Intel Intrinsics as API, not NEON intrinsics. Like you I've discovered it's a lot of work ^^.

Ref: https://github.com/AuburnSounds/intel-intrinsics

mr-c commented 3 years ago

@p0nce Cool! Is there anything we can do so that you all can reuse SIMDe?

SIMDe is MIT licensed, so it should be easy to bring code over (with attribution, please!)

p0nce commented 3 years ago

The only place where we used simde code was for _mm_madd_epi16 ARM version.

simde commit: https://github.com/simd-everywhere/simde/blob/2f88a0ce46325080957d78c7fe773aa2f55e1394/simde/x86/sse2.h#L1691 intel-intrinsics: https://github.com/AuburnSounds/intel-intrinsics/commit/10a93485f200c017e34c6d4d57849539a268e01e

I don't really feel like imposing binary redistribution licences such as MIT/BSD on dependent projects, so I will remove that piece of code (unless you can give me explicit permission, relicence under Boost 1.0 this piece of code? or give permission). @nemequ

As per Boost vs MIT, the compatibility of the two licences seems unfortunately more difficult that I was expecting: https://law.stackexchange.com/questions/91/is-there-any-difference-in-meaning-between-the-boost-and-mit-software-licenses

Sad, as simde source code is super useful.

nemequ commented 3 years ago

I can't really change the license; I don't own all the code. Some has been contributed by others, and they retain ownership of their code (SIMDe doesn't require copyright assignment). Also, some code was copied from SSE2NEON (also MIT, which is a large part of the reason I chose MIT for this project).

Yeah, feel free to use that under the Boost license; it looks like I was the original author, so I can relicense it.

Actually, looking at it that could be improved on AArch64… a better NEON implementation would be

    #if defined(SIMDE_ARM_NEON_A64V8_NATIVE)
      int32x4_t pl = vmull_s16(vget_low_s16(a_.neon_i16),  vget_low_s16(b_.neon_i16));
      int32x4_t ph = vmull_high_s16(a_.neon_i16, b_.neon_i16);
      r_.neon_i32 = vpaddq_s32(pl, ph);
    #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
      int32x4_t pl = vmull_s16(vget_low_s16(a_.neon_i16),  vget_low_s16(b_.neon_i16));
      int32x4_t ph = vmull_s16(vget_high_s16(a_.neon_i16), vget_high_s16(b_.neon_i16));
      int32x2_t rl = vpadd_s32(vget_low_s32(pl), vget_high_s32(pl));
      int32x2_t rh = vpadd_s32(vget_low_s32(ph), vget_high_s32(ph));
      r_.neon_i32 = vcombine_s32(rl, rh);
    #endif

Feel free to use that under Boost, too, of course. There is also a pretty trivial implementation (using vec_msum) on AltiVec, but AFAICT you don't care about that.

p0nce commented 3 years ago

Thanks! Will use that second solution.

nemequ commented 3 years ago

I've added an entry to the Alternatives page on the Wiki; the description is a bit vague since I'm not really familiar with the state of SIMD in D, but feel free to add more info if you want.

p0nce commented 3 years ago
nemequ commented 3 years ago

I was wondering about that; I didn't see any calls to native intrinsics, but I thought maybe that's because there was a different implementation. So the NEON functions aren't implemented using intrinsics, either, they're just wrappers for more portable code?

You say it targets x86 and ARM, but is there anything that actually ties you to those architectures? AFAICT everything is implemented using portable code.

I guess core.simd would be the equivalent of using intrinsics (specifically the functions which take an XMM opcode), but it seems like that only supports 128-bit vectors. Am I missing something?

p0nce commented 3 years ago

Haha it's complicated :)

D without this library provides a variety of support for SIMD.

So we are using everything at disposal to try to have something semantically equivalent to all intel intrinsics It is by and large an attempt to simplify the mess that D SIMD support was before. In practice, top speed is only achieved with LDC for now. And that's where we pay attention to godbolt output the most.

ARM is quite difficult because the compilers don't provide much in builtins.

The Intel Intrinsics are really a portable API that C++ compilers choose to let user ignore their differences. Instead D compilers chose the less featureful core.simd. But I think intel intrinsics are a cultural stapple in native programming somehow.

You say it targets x86 and ARM, but is there anything that actually ties you to those architectures?

Yes, we have portable implementation of everything, but often __builtin_XXX are used and with LLVM and GCC backend this isn't portable to other arch.

A slide of intel-intrinsics talk discussed about this. image

nemequ commented 3 years ago

Eek, that does seem like quite a mess, sorry :(

I suppose there isn't anything we can do on the SIMDe side to make this any easier? You're right that our goals seem pretty similar, so it seems a shame to duplicate so much effort…

p0nce commented 3 years ago

I guess we won't change licence, so we can always exchange persmission in case a piece of code is interesting. :) As long as I have copyright this will always be yes from me of course.

nemequ commented 3 years ago

Me too; I don't mind licensing code from SIMDe under Boost if I'm the original author.

If there is anything else we can do on the SIMDe side which would help you please let us know.

p0nce commented 3 years ago

Well, continue what you do ^^.

I have a question about the general efficiency of NEON over SSE. Something I wonder is if it's worth it to also implement neon ARM intrinsics as part of the public API, like simde does. I don't know if specially optimised ARM goes much further than x86, or if it you can stay close to it speed-wise using emulated x86 intrinsics. It' s a bit more different than I was expecting.

nemequ commented 3 years ago

In C/C++ you can potentially see huge performance improvements by porting to another ISA extension instead of relying on a layer like SIMDe, or you might see no improvement at all.

It really comes down to which functions you are using; an emulated _mm_add_epi32 is basically going to be just as fast as a native call everywhere (vaddq_s32 on NEON, vec_add on AltiVec, wasm_i32x4_add on WebAssembly, etc.) because the semantics match up perfectly. On the other hand, most SIMD ISAs have functionality that others don't really have and would need to be emulated using a sequence of several functions, sometimes a rather long and slow one. For example, if you want to get the sum of all element in a vector, NEON's vaddv_* intrinsics are going to be extremely fast, but there is nothing similar on Intel so you have to string together several instructions (which ones depend on the element type).

When manually porting people are likely to tailor their approach to the ISA they are targeting. For example, maybe they'll use a horizontal sum operation on NEON, but on x86 they may take another approach that doesn't require a horizontal sum. APIs like SIMDe and intel-intrinsics are really at too low a level to have many options about which approach to take; by the time people call functions at that level they're basically already locked in to an approach.

An abstraction layer at a higher level can do a better job at choosing the fastest approach, but you'll also likely be leaving some performance on the table since the abstractions they use generally don't perfectly match the underlying hardware.

Another complication is that you seem to be mostly using portable implementations and relying on the compiler to select the instruction that corresponds to that implementation. That generally works well for simpler instructions like add, subtract, etc., but for more complicated operations there is a good chance the compiler won't be smart enough to recognize that the pattern you're using matches up with a specific instruction and instead it will output a sequence of instructions anyways. In that case, my guess is that performance on NEON and x86 would be closer because they would both be relatively slow. Actually, NEON hardware probably has an advantage there since they have much better coverage of basic operations.

You can use Compiler Explorer to play around with this idea a bit if you want. For example, _mm_add_epi32 vs. _mm_maddubs_epi16.

p0nce commented 3 years ago

Another complication is that you seem to be mostly using portable implementations and relying on the compiler to select the instruction that corresponds to that implementation. That generally works well for simpler instructions like add, subtract, etc., but for more complicated operations there is a good chance the compiler won't be smart enough to recognize that the pattern you're using matches up with a specific instruction and instead it will output a sequence of instructions anyways.

Well actually every single such intrinsic is tested in Godbolt with LDC, often we mark since which -O optimization level and LDC version (LLVM version) the right instruction start being generated. Our goal is to explicitely use the least amount of "builtins", since they can be removed from LLVM from time to time.

What we've found is that there is hardly a regression in how well the LLVM can infer the right instruction.

Of course this breaks down once you have multiple different backends, you'll need separate version especially for GCC that is less able to infer the right instruction and need builtins.

Though we've had one such problem lately when the instruction is generated in isolation but not in a larger function: https://github.com/ldc-developers/ldc/issues/3587

All in all i don't think there is a problem we use whatever generates the instruction, sometimes it's dumb code.

nemequ commented 3 years ago

Oops, sorry, I somehow missed the notification for this :(

Our goal is to explicitly use the least amount of "builtins", since they can be removed from LLVM from time to time.

I assume this applies mostly to architecture-specific builtins, not stuff like __builtin_shufflevector / __builtin_convertvector (or whatever they're called in D)? In C/C++ we're mostly lucky enough to have an abstraction layer in the form of the official APIs (SSE, AVX, NEON, AltiVec, etc.) so we don't really call the underlying intrinsics directly. We just have to smooth over any differences between the official APIs and what the compilers actually provide (i.e., sometimes a function is missing, a type is wrong, implementation is buggy, etc.).

What we've found is that there is hardly a regression in how well the LLVM can infer the right instruction.

That's awesome. LLVM tends to be excellent at autovectorization on the C/C++, too. It even seems like it usually beats ICC when I test, but I think most people have the opposite experience.

What about other compilers? For us, GCC usually does fairly well, but MSVC is terrible, and I doubt D compilers are as well-resourced so unless they're using LLVM for codegen I'd be surprised to see them handle this very well. Of course I'm not sure there is much you can do about this, other than perhaps complain to the compiler people.

Though we've had one such problem lately when the instruction is generated in isolation but not in a larger function: ldc-developers/ldc#3587

Yep, we run into those, too. A similar issue we is when using 256 or 512-bit functions on a target which only supports 128-bit vectors the compiler generates a lot of extra moves for each function call, but since everything is inlined they are mostly optimized out under real conditions.

All in all i don't think there is a problem we use whatever generates the instruction, sometimes it's dumb code.

Yeah, often the most important optimization is to find a pattern the compiler recognizes. There are lots of times where I've come up with a clever optimization which should make things faster only to realize that the compiler can't recognize it and generates slower code :(

In any case, it's awesome that you're testing every function like that :)

p0nce commented 3 years ago

In C/C++ we're mostly lucky enough to have an abstraction layer in the form of the official APIs

Yes I dearly miss that layer, probably because it's a lot of work that C++ compiler people put into it ? I think I will never get to implement the Neon intrinsics as public API for this reason.

What about other compilers? For us, GCC usually does fairly well, but MSVC is terrible, and I doubt D compilers are as well-resourced so unless they're using LLVM for codegen I'd be surprised to see them handle this very well. Of course I'm not sure there is much you can do about this, other than perhaps complain to the compiler people.

Good question. In D you have 3 compilers:

When using DUB you can easily swap compilers and build anything with --compiler ldc2|dmd|gdc. "Modern" native language biggest contribution is, ironically, the language package manager. ^^

In my experience ICC was more awesome as a backend, always impressive, but this was ago when I was doing codec optimization in C++. Since LLVM advances and GCC 5 I'm not sure how close the gap really is. LLVM sure seems to have less regression or outright codegen bugs than ICC.

nemequ commented 3 years ago

Yes I dearly miss that layer, probably because it's a lot of work that C++ compiler people put into it ?

Well, it's really the standard way of calling the functions on the C/C++ side; the manufacturer (Intel, Arm, etc.) defines the API, and functions are generally designed to map 1:1 to an instruction. The builtins defined by different compilers are really just implementation details of how to interface with the compiler so they can provide that API; they're not really meant to be called directly (except for the ones which aren't backend-specific (mostly documented here, but there are a few others).

Since there are no "official" APIs for D I guess you're stuck with unofficial, un-standardized APIs.

In my experience ICC was more awesome as a backend, always impressive, but this was ago when I was doing codec optimization in C++.

I think the general consensus is that ICC is still the best, and benchmarks tend to bear that out. For example, one of the big reasons Clear Linux performs so well is that they use ICC to compile everything. I don't know why I seem to get better results from clang when I test; it's probably just an anomaly…

Since LLVM advances and GCC 5 I'm not sure how close the gap really is.

It's still there, but ICC does have one major cheat: it defaults to fast math (i.e., -ffast-math on clang and gcc). You have to pass a flag (-fp-model=precise, IIRC) to get "correct" results on ICC, but most benchmarks don't change the default. Interestingly, ICC can actually perform significantly worse in SIMDe since we can detect when fast math is enabled in gcc/clang (there is a preprocessor macro) and use faster implementations for some functions, but on ICC there is (AFAIK) no way to detect the FP model so we default to the correct mode and you have to define SIMDE_FAST_MATH manually to get the fast implementations.

LLVM sure seems to have less regression or outright codegen bugs than ICC.

Or GCC. At this point I think LLVM is pretty clearly the most correct, and ICC seems to be the buggiest of the three (though GCC does provide some stiff competition to ICC sometimes). I can't be sure, but I think a lot of the credit here goes to LLVM's tooling. They have really cool stuff like Alive and Alive2 which (AFAIK) just don't seem to have any analogues in other compilers.

Most of the bugs I've found in LLVM while developing SIMDe tend to be API issues (missing functions, incorrect types, etc.) not incorrect optimizations. ICC seems to be weak on the frontend, whereas GCC's problems seem to be largely backend or optimization-related. At least in my experience.

p0nce commented 3 years ago

Since there are no "official" APIs for D I guess you're stuck with unofficial, un-standardized APIs.

Yes, I'm the one building the standardized API over the differences. :-|

it defaults to fast math (i.e., -ffast-math on clang and gcc).

Ouch. I really dislike this as a default. For my audio products I tried --fast-math many times with LLVM and it was roughly 1% slower everytime, don't know why. If an optimization is both dangerous (audio changes that are difficult to measure and hear) and doesn't pay off... well... I hope it works out for ICC but I find it odd.

ICC seems to be weak on the frontend, whereas GCC's problems seem to be largely backend or optimization-related. At least in my experience.

From memory, it was also the backend mostly. We were having one ICC backend bug every six month: some codegen, some backend ICE ; it wasn't all front-end. YMMV. I've yet to see a codegen regression in LLVM, perhaps it's the optimization level I ask for that isn't as high.