Add support for ARM hardware intrinsics

Shnatsel commented 1 year ago

ARM provides intrinsics to convert from f32 to f16 since ARMv8, see e.g. VCVT-F16-F32

Unfortunately the Rust standard library does not implement this intrinsic yet, even though it does implement lots of similar ones - e.g. vcvt_f64_f32.

Adding support for this intrinsic in the standard library should be fairly trivial, since all the groundwork is already laid out.

starkat99 commented 1 year ago

Ya, been waiting for these to show up, but probably just going to have to PR them myself like I'm doing for the F16C stabilization.

starkat99 commented 1 year ago

Did some digging. It looks like the holdup for ARM conversions is that the ARM code all relies on simd_cast LLVM intrinsic rather than the ARM intrinsics directly, and the LLVM simd_cast relies on proper typing to use the right hardware intrinsics. Without a builtin f16 type, float16x4_t doesn't have a way to indicate the right cast type for simd_cast to implement the ARM intrinsics.

So it would be a completely different implementation that accessed the LLVM hardware intrinsics directly without going through the easier simd_cast. It looks like relevant LLVM intrinsics are llvm.arm.vcvt[b,t].f32.f16 and llvm.arm.vcvt[b,t].f16.f32. Not sure the difference between b/t variants. But it's definitely not as straightforward as just doing same as other conversions.

Shnatsel commented 1 year ago

I think the only reasonable option in the short term is to bypass the intrinsics and use inline assembly to get to these instructions.

Inline assembly on ARM has been stable since Rust 1.59, so this will not even require a nightly compiler.

Shnatsel commented 1 year ago

is_aarch64_feature_detected has been stable since 1.60, so looks like with inline assembly this could be implemented in stable today, without waiting on any std or compiler features.

starkat99 commented 1 year ago

Interesting, did not realize that. Though 32-bit ARM detection is NOT stable (so could not detect "neon" feature set for this) so there would be some disparity but definitely worth getting this working.

starkat99 commented 1 year ago

I've implemented the assembly conversions in the aarch64-intrinsics branch, and it's passing tests on CI, although I don't have access to an ARM machine to benchmark it myself. Currently still nightly toolchain only, I'll try to get it working on stable rust before merging the branch.

Shnatsel commented 1 year ago

Amazing! That was quick!

As for benchmarking, a number of public clouds provide ARM machines and free compute credits upon signup.

For example, Google Cloud (full disclosure: they're my employer, I may be biased) provides $300 in free credit, and they have ARM machines. That should be more than enough to test and benchmark this on real hardware.

Shnatsel commented 1 year ago

Also, I have access to an ARM machine, and I could run the benchmarks you specify.

Note that the criterion benchmarks in the repository lack black-boxing, which may make them not representative of the real-world performance. Both the input and output should be wrapped in std::hint::black_box. You can find more info here.

starkat99 commented 1 year ago

Merged the aarch64 branch, so main branch now has AArch64 hardware support on Stable Rust now. Also includes arithmetic hardware operations on f16 now too. Because it requires a MSRV bump, it'll be a bit before crates gets a new release, given the MSRV policy for this crate requires a major version bump. Will probably wait until the 1.69 when x86 F16C support should be stabilized.

Leaving issue open for future possible 32-bit ARM support

Shnatsel commented 1 year ago

I assume the upcoming release also enabled the use-intrinsics feature by default, or get rid of it entirely?

starkat99 commented 1 year ago

Feature is still there until the F16C stuff gets stabilized. Will probably remove after yes

Shnatsel commented 1 year ago

the MSRV policy for this crate requires a major version bump.

This is unfortunate because all dependents that re-export the type or expose it in the public API at all, such as the exr crate, will also have to make a breaking change to get these improvements. So a major version bump in half will force many other dependents to make a major version bump themselves, which is undesirable.

Worse, the major version bumps would also have to be carried out in sync across all dependents, or they will lose interoperability with each other. The ecosystem had it already with futures crate versions 0.1 and 0.3, and it was Not Fun.

Since no breaking changes are actually made other than MSRV, may I instead suggest keeping the use-intrinsics feature off by default for a while longer, and turning it on by default a few releases after stabilization with a minor version bump? That gives control over the MSRV to the end user who's actually tuning these features, instead of forcing crate maintainers to pick either the version with or without intrinstics and splitting the ecosystem.

starkat99 commented 1 year ago

The MSRV policy is due to a previous minor version bump with MSRV bump that caused compile failures in downstream creates. It's kind of tough either way, with either choice causing downstream issues. I'll see what I can do about phasing out use-intrinsics in the way you mentioned, that may work in this case. I also want to see if I can finangle a deprecation warning on the use-intrinsics feature somehow, which could help with this issue.

starkat99 commented 1 year ago

Now that 1.70 is out with the x86 f16c intrinsics stabilized, I think I'll just do the minor release with a full MSRV bump and all the hardware support automatically enabled, deprecating use-intrinsics, and just pull the bandage off all at once.

utkarshgupta137 commented 4 months ago

Not sure if this is part of this issue, but aarch64 has bf16/svebf16 features, which are currently not used by this crate?

starkat99 / half-rs

Add support for ARM hardware intrinsics #63