Open Shnatsel opened 1 year ago
Ya, been waiting for these to show up, but probably just going to have to PR them myself like I'm doing for the F16C stabilization.
Did some digging. It looks like the holdup for ARM conversions is that the ARM code all relies on simd_cast
LLVM intrinsic rather than the ARM intrinsics directly, and the LLVM simd_cast
relies on proper typing to use the right hardware intrinsics. Without a builtin f16
type, float16x4_t
doesn't have a way to indicate the right cast type for simd_cast
to implement the ARM intrinsics.
So it would be a completely different implementation that accessed the LLVM hardware intrinsics directly without going through the easier simd_cast
. It looks like relevant LLVM intrinsics are llvm.arm.vcvt[b,t].f32.f16
and llvm.arm.vcvt[b,t].f16.f32
. Not sure the difference between b/t variants. But it's definitely not as straightforward as just doing same as other conversions.
I think the only reasonable option in the short term is to bypass the intrinsics and use inline assembly to get to these instructions.
Inline assembly on ARM has been stable since Rust 1.59, so this will not even require a nightly compiler.
is_aarch64_feature_detected has been stable since 1.60, so looks like with inline assembly this could be implemented in stable today, without waiting on any std or compiler features.
Interesting, did not realize that. Though 32-bit ARM detection is NOT stable (so could not detect "neon"
feature set for this) so there would be some disparity but definitely worth getting this working.
I've implemented the assembly conversions in the aarch64-intrinsics
branch, and it's passing tests on CI, although I don't have access to an ARM machine to benchmark it myself. Currently still nightly toolchain only, I'll try to get it working on stable rust before merging the branch.
Amazing! That was quick!
As for benchmarking, a number of public clouds provide ARM machines and free compute credits upon signup.
For example, Google Cloud (full disclosure: they're my employer, I may be biased) provides $300 in free credit, and they have ARM machines. That should be more than enough to test and benchmark this on real hardware.
Also, I have access to an ARM machine, and I could run the benchmarks you specify.
Note that the criterion benchmarks in the repository lack black-boxing, which may make them not representative of the real-world performance. Both the input and output should be wrapped in std::hint::black_box
. You can find more info here.
Merged the aarch64 branch, so main branch now has AArch64 hardware support on Stable Rust now. Also includes arithmetic hardware operations on f16
now too. Because it requires a MSRV bump, it'll be a bit before crates gets a new release, given the MSRV policy for this crate requires a major version bump. Will probably wait until the 1.69 when x86 F16C support should be stabilized.
Leaving issue open for future possible 32-bit ARM support
I assume the upcoming release also enabled the use-intrinsics
feature by default, or get rid of it entirely?
Feature is still there until the F16C stuff gets stabilized. Will probably remove after yes
the MSRV policy for this crate requires a major version bump.
This is unfortunate because all dependents that re-export the type or expose it in the public API at all, such as the exr
crate, will also have to make a breaking change to get these improvements. So a major version bump in half
will force many other dependents to make a major version bump themselves, which is undesirable.
Worse, the major version bumps would also have to be carried out in sync across all dependents, or they will lose interoperability with each other. The ecosystem had it already with futures crate versions 0.1 and 0.3, and it was Not Fun.
Since no breaking changes are actually made other than MSRV, may I instead suggest keeping the use-intrinsics
feature off by default for a while longer, and turning it on by default a few releases after stabilization with a minor version bump? That gives control over the MSRV to the end user who's actually tuning these features, instead of forcing crate maintainers to pick either the version with or without intrinstics and splitting the ecosystem.
The MSRV policy is due to a previous minor version bump with MSRV bump that caused compile failures in downstream creates. It's kind of tough either way, with either choice causing downstream issues. I'll see what I can do about phasing out use-intrinsics in the way you mentioned, that may work in this case. I also want to see if I can finangle a deprecation warning on the use-intrinsics
feature somehow, which could help with this issue.
Now that 1.70 is out with the x86 f16c
intrinsics stabilized, I think I'll just do the minor release with a full MSRV bump and all the hardware support automatically enabled, deprecating use-intrinsics
, and just pull the bandage off all at once.
Not sure if this is part of this issue, but aarch64 has bf16/svebf16 features, which are currently not used by this crate?
ARM provides intrinsics to convert from f32 to f16 since ARMv8, see e.g. VCVT-F16-F32
Unfortunately the Rust standard library does not implement this intrinsic yet, even though it does implement lots of similar ones - e.g. vcvt_f64_f32.
Adding support for this intrinsic in the standard library should be fairly trivial, since all the groundwork is already laid out.