Add option to compile directly to llvm intrinsics

Finomnis commented 1 year ago

Many targets, like thumbv7em-none-eabihf have a hardware floating point unit which supports operations like f32::abs() or f32::sqrt(). Using those would improve performance and reduce code footprint.

Sadly, in no-std, those are currently only obtainable through the unstable core::intrinsics API. A discussion about adding those to stable exists, bit is stale for quite a while now. Still, for maximum performance, it would be worth it to add an unstable feature flag (or experimental_intrinsics or similar) which compiles directly to intrinsics.

This would work the following way. LLVM intrinsics can be compiled in three different ways by LLVM, depending on the circumstances. For example sqrtf32():

to a constant, if compile time evaluatable
to a hardware instruction if supported (like thumbv7em-none-eabihf)
to the function call sqrtf32 as a fallback if no hardware support exists. This function has to be provided by us, via extern "C" and #[no_mangle].

Advantages:

Out-of-the-box compile time evaluation, if possible (built into llvm)
Much smaller footprint
Much faster performance

Disadvantages:

Depending on the target, the provided fallback functions might collide with the std library. To be determined. That's why I would definitely keep this behind a non-default feature gate.
Requires a nightly compiler. This is another reason why I would keep it behind a feature gate.

If this sparks interest from the crate developer, I might start an implementation and open a PR. If not, I'm fine with having this issue closed.

tarcieri commented 1 year ago

We can use unstable core intrinsics (I'd prefer a nightly feature for that, though keeping up with nightly changes is always a bit painful), but if the goal is primarily optimizing thumbv7em-none-eabihf, it seems like inline assembly would be a stable alternative?

Finomnis commented 1 year ago

if the goal is primarily optimizing thumbv7em-none-eabihf, it seems like inline assembly would be a stable alternative?

Well yes, that is true, but by implementing it via intrinsics would make this compatible with all boards, not just this one. And inline assembly isn't optimizable via LLVM, so things like compile time evaluation wouldn't happen.

This is my current solution in a project:


pub trait F32Ext {
    /// Sinus
    fn sin(self) -> Self;
    /// Cosinus
    fn cos(self) -> Self;
}

#[no_mangle]
extern "C" fn sinf(val: f32) -> f32 {
    micromath::F32(val).sin().0
}

#[no_mangle]
extern "C" fn cosf(val: f32) -> f32 {
    micromath::F32(val).cos().0
}

impl F32Ext for f32 {
    fn sin(self) -> f32 {
        unsafe { core::intrinsics::sinf32(self) }
    }

    fn cos(self) -> f32 {
        unsafe { core::intrinsics::cosf32(self) }
    }
}

I thought surely I'm not the only person who came across this issue, so I thought maybe it's worth including something like this in a library :)

But if you think it's too much effort to maintain, I'll just keep it in my local project.

tarcieri commented 1 year ago

Well yes, that is true, but by implementing it via intrinsics would make this compatible with all boards, not just this one

Shouldn't the optimizations work on all boards with an thumbv7em-none-eabihf MCU? I don't see why the assembly would vary by board.

And inline assembly isn't optimizable via LLVM, so things like compile time evaluation wouldn't happen.

While that's a minor drawback, it's one I think is outweighed by nightly vs stable. nightly is a big maintenance burden because it's a moving target, which means things may break from nightly-release-to-release.

Finomnis commented 1 year ago

Shouldn't the optimizations work on all boards with an thumbv7em-none-eabihf MCU? I don't see why the assembly would vary by board.

I might have misworded it, I mean it's compatible with all targets then, not just thumbv7em-none-eabihf. Basically every other target would then also automatically choose either a hardware implementation or micromath as a fallback, instead of using the micromath implementation by default. Even other micromath functions that depend on each other would profit if some of the calls were hardware accelerated.

While that's a minor drawback, it's one I think is outweighed by nightly vs stable. nightly is a big maintenance burden because it's a moving target, which means things may break from nightly-release-to-release.

And if we add a description to the flag the we give zero stability guarantees? The rest of the project (if the feature isn't enabled) should of course stay compatible with stable.

Finomnis commented 1 year ago

Of course if you say it's too much of a hastle, I might release it as a separate package.

tarcieri commented 1 year ago

I have nightly features on other projects and it can be quite annoying. Even if it's pinned there can be sporadic breakages (e.g. ICE)

This crate is low maintenance enough, however, that I guess I'd be OK with it, so long as nightly were pinned and we don't get frequent requests to bump the nightly version and cut releases due to breakages like I experience with other projects.

Finomnis commented 1 year ago

I started with doing some benchmarks, and it seems like only abs and sqrt are available on thumbv7em-none-eabihf. And sadly, it makes barely a difference.

(on Teensy 4.0)

===== Micromath Benchmark =====
Git Version: 9a7492b-modified

All values in ns/iter.

            micromath       libm intrinsics
abs              16.6       16.6       16.7
acos            178.0      120.3
asin            110.1      139.6
atan             85.0       78.5
atan_norm        76.8
ceil             48.4       58.1
cos             105.0     1668.0
exp             233.5      116.8
floor            38.5       61.4
fract            21.6
inv              16.6
invsqrt          16.6
ln              161.8      157.4
log2            166.9      177.6
log10           166.8      174.2
round            36.6
sin             110.0     1707.4
sqrt             53.4      491.8       38.4
tan             160.1     2560.6
trunc            26.3       45.0

There would still be the advantages of being compile time optimizable and being more accurate. But it seems micromath is really optimized already. Kudos.

Finomnis commented 1 year ago

@tarcieri Should I create a PR with those benchmarks? It only includes the 1-parameter functions, so far; and I probably won't add more since I will no longer use intrinsics for myself now that I know how little the gain is.

Either way, here they are in case you are curious: https://github.com/Finomnis/micromath/tree/teensy40_benchmark/examples/benchmark_teensy40

tarcieri commented 1 year ago

Yes sure, that'd be great

Finomnis commented 1 year ago

@tarcieri #98

Finomnis commented 1 year ago

@tarcieri Btw, it seems that some of the libm implementations are faster than ours, at least on the Teensy 4.0. Might be worth investigating. The biggest difference seems to be acos. Although I'm only running it repeatedly on one single value, I'm not sure how input dependent the time is.

Finomnis commented 1 year ago

Will close this for now as I won't continue working on it.

tarcieri commented 1 year ago

it seems that some of the libm implementations are faster than ours

Yeah, that's an interesting data point, although this library optimizes for code size over performance.

I can investigate if there are any implementations which are both faster and relatively similar in terms of code size and potentially adopt those.

tarcieri / micromath

Add option to compile directly to llvm intrinsics #97