rust-lang / portable-simd

The testing ground for the future of portable SIMD in Rust
Apache License 2.0
903 stars 81 forks source link

Aligning std::simd and Rust on Arm v7 Neon float behavior #439

Open workingjubilee opened 2 months ago

workingjubilee commented 2 months ago

This is going to be a bit grisly: the Arm v7 Neon registers flush subnormals and Rust has defined floats as to deny flushing subnormals to be a valid behavior. If we want std::simd to align here with scalar ops, we will have to unfortunately kinda chuck the vector ops for non-integer operations.

Meta

rustc --version --verbose:

rustc 1.83.0-nightly (0ee7cb5e3 2024-09-10)
binary: rustc
commit-hash: 0ee7cb5e3633502d9a90a85c3c367eccd59a0aba
commit-date: 2024-09-10
host: x86_64-unknown-linux-gnu
release: 1.83.0-nightly
LLVM version: 19.1.0
RalfJung commented 2 months ago

This seems like basically the same issue as https://github.com/rust-lang/rust/issues/129880, but might be worth tracking in this repo as well I guess?

I guess stdarch is also affected, but arguably there it is okay to expose the underlying hardware behavior... that is, assuming we don't get unsoundness due to https://github.com/llvm/llvm-project/issues/89885.

workingjubilee commented 2 months ago

@RalfJung It has particular considerations for our API design yes.

DemiMarie commented 1 week ago

I don’t think it makes sense to expect vector operations to have defined subnormal behavior. There is too much hardware where perfect IEEE conformance is either impossible or requires software support code. Making flushing subnormals to zero permissible behavior is the only approach that allows for predictable runtime performance and predictable lowering to target-specific assembly.

RalfJung commented 1 week ago

Unfortunately LLVM is unsound on hardware that flushes subnormals.

predictable runtime performance and predictable lowering

And completely unpredictable runtime behavior. Great.

workingjubilee commented 1 week ago

@DemiMarie easily done, all it needs is a small fix in LLVMIR and SelectionDAG: https://github.com/llvm/llvm-project/issues/30633

calebzulawski commented 1 week ago

Is it unpredictable because of reordering? I don't see what can be accomplished that doesn't make std::simd useless on armv7 or ppc other than allowing ftz

RalfJung commented 1 week ago

It is unpredictable in the sense of giving different results on different targets, and (depending on what semantics LLVM implements once they properly support NEON on 32-bit ARM, which currently they do not) different optimization levels and different ways of writing the same code.

calebzulawski commented 1 week ago

Considering these are old targets I'm not expecting a huge push to fix the backends, but would simply disallowing certain optimizations be sufficient? We do note in the std::simd docs that ftz will happen on some targets. We could e.g. expose a cfg value if necessary.

RalfJung commented 1 week ago

I mean we could try to disable the scalar evolution pass and hope that this suffices. But that's far from a robust solution, so it's not really aligned with Rust's values IMO.

RalfJung commented 1 week ago

Anyway I think portable-simd has a lot of things to resolve before this becomes a pressing question. Right now, not even the core::arch operations are stable on ARM32.

DemiMarie commented 1 week ago

Unfortunately LLVM is unsound on hardware that flushes subnormals.

predictable runtime performance and predictable lowering

And completely unpredictable runtime behavior. Great.

This can be worked around by implementing the relevant intrinsics using LLVM inline assembly instead.

RalfJung commented 1 week ago

That would not achieve the "predictable runtime performance" part of your goals, as the optimizer would have to treat this like a black box.

And behavior would still be unpredictable in the sense of differing across architectures. So IMO it would also be reasonable to say that portable-simd is simply not supported on 32bit ARM, and only provide core::arch primitives where people are hopefully aware of the semantic pitfalls.

But anyway as I said, we're likely years away from this being a high-priority question. First all of the rest of the portable-simd API needs to be worked out...

DemiMarie commented 1 week ago

That would not achieve the "predictable runtime performance" part of your goals, as the optimizer would have to treat this like a black box.

Is the optimizer actually able to usefully reason about SIMD intrinsics anyway? The optimizer can (IIUC) be informed that the operations don’t access memory and can be elided if their result is not needed. My understanding is that SIMD programmers typically use the compiler as a glorified register allocator and so don’t particularly care about other optimizations. Is this accurate?

RalfJung commented 1 week ago

The simd_* intrinsics, which are used for everything in portable-simd, are fully understood by LLVM and can be optimized like scalar operations. I don't know how much that matters in practice, but const-folding does seem like a useful optimization even for SIMD.

DemiMarie commented 1 week ago

I think it would be better to have SIMD that cannot be constant-folded than to not have SIMD at all.