Open workingjubilee opened 3 years ago
vs C (clang) on a given architecture for equivalent code
What's equivalent? I think this might be tricky. I remember filing https://github.com/rust-lang/stdarch/pull/1155 because the generated code in a hot loop was about 2x slower than expected. This was because it implemented the behavior of the LLVM intrinsics, which took around 3x as many instructions as the native intrinsics.
This is disappointing, since I suspect it means in practice that "portable simd" will always have a cost, and you'll be better off for the architecture-specific instructions if you can afford to write it, and know that you don't have the problem cases. (My hope is that some -ffast-math
-style support can bridge the gap here, but I suspect it will be quite difficult to push all the way thorough)
This is different than what you're asking for around inlining failure. I expect that to happen around -Oz or -Os levels in some cases, which is unfortunate and kind of tricky to address even if we find it.
It would be nice if "target" defaulted to "native" as the current default for x86_64 is for a rather ancient architecture.
The best way to make code portable, it seems, is to use conditional compilation for avx2 and other features.
I was thinking of a "go_faster!" macro that could wrap high level code and use the best features available such as avx2, avx512, sve using a combination of conditional compilation and runtime switching.
We could also wrap some of the more terrible llvm SIMD multi-instruction generics in conditional compilation wrappers. For example, round and min/max are currently multi instruction sequences.
It would be nice if "target" defaulted to "native" as the current default for x86_64 is for a rather ancient architecture.
This change would be much broader than SIMD, of course, but I think this is unlikely to ever happen because I think most of the time it is expected that your code will run on other machines with similar architecture, by default. This problem only affects x86-64, which is why clang etc have already added x86-64 levels (such as x86-64-v3) which is probably the best way to handle that. This is similar to how it's already handled for arm v7.
The best way to make code portable, it seems, is to use conditional compilation for avx2 and other features.
I was thinking of a "go_faster!" macro that could wrap high level code and use the best features available such as avx2, avx512, sve using a combination of conditional compilation and runtime switching.
You may be interested in my multiversion crate.
We could also wrap some of the more terrible llvm SIMD multi-instruction generics in conditional compilation wrappers. For example, round and min/max are currently multi instruction sequences.
All of the intrinsics we use right now generate code for the target feature level in the user's crate (they are all inline functions). If anything is resulting in suboptimal codegen either it's a limitation of your target features, or it may be a bug in LLVM.
These are some examples I found in old Rust issues that seem to qualify under this problem. Interestingly, it seems that C++ may be a bigger rival than C, here.
IIRC some of the SIMD dialects, and certainly LLVM, allow immediates to describe some vector patterns, so we should check whether we actually emit that asm when it is in fact const-known:
We could also wrap some of the more terrible llvm SIMD multi-instruction generics in conditional compilation wrappers. For example, round and min/max are currently multi instruction sequences.
All of the intrinsics we use right now generate code for the target feature level in the user's crate (they are all inline functions). If anything is resulting in suboptimal codegen either it's a limitation of your target features, or it may be a bug in LLVM.
simd_min()/simd_max() generate something like this on x86:
vminps ymm2, ymm1, ymm0
vcmpunordps ymm0, ymm0, ymm0
vblendvps ymm0, ymm2, ymm1, ymm0
in order to have the right semantics if an argument is NaN. If you just want minps
(because you don't care about NaN or you actually want those semantics), x.simd_lt(y).select(x, y)
generates
vminps ymm2, ymm1, ymm0
One thing we could use to help check the library against is examples of Rust SIMD perf... and in particular, anything that is actually a regression, especially relative to expectations. In particular, it may help motivate a solution to https://github.com/rust-lang/rust/issues/64609 if we can find examples of bad or divergent SIMD performance for Rust on a given architecture vs C (clang) on a given architecture for equivalent code. I had a conversation with compiler devs who are more familiar with the inner workings of LLVM and the compiler's SIMD machinery, and they expect LLVM to see through and properly handle the "pass through memory" trick if things are inlined. So we're looking for examples where LLVM mysteriously fails or just enough ops are done that LLVM decides inlining them all isn't practical.
This obviously is not at all the case where we just completely scalarize things, so we're ignoring #76 for the purposes of this example, and it doesn't actually have to be related to our
core::simd
implementation. Rather, it's just an overall concern: if we can cough up examples we can compare against, it would help us bench, profile, and test possible solutions.I'm also not actually limiting this to just Rust vs. C, clang just happens to be there and is also LLVM-driven. Anything where our SIMD takes a beating vs.
${LANG}
is a good example. And things where we're only on par