wasm32 reduce.fmin/fmax work incorrectly for <float x 32> vectors containing NaNs

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago


Bugzilla Link	38599
Version	trunk
OS	All
CC	@aemerson,@dschuff,@tlively

Extended Description

This is an attempt at reporting Rust packed_simd/91 (https://github.com/rust-lang-nursery/packed_simd/issues/91).

There we have to functions, min_element and max_element, that work on packed float vectors. For <4 x float> and min we emit the following LLVM-IR:

define float @min_4x32(<4 x float> %a) { %b = tail call float @llvm.experimental.reduce.fmin.v4f32(<4 x float> %a) ret float %b } declare float @llvm.experimental.reduce.fmin.v4f32(<4 x float>)

We have a test that initializes the first element of the vector with NaN, and all other elements with 3.0, and checks that the result is 3.0.

This tests passes on wasm32-unknown-unknown for all f32 and f64 vectors except for f32x16 (512-bit wide) vectors, where it fails for both the min and max operations. Instead of returning 3.0, this functions return NaN (the input is <float x16> <NaN, 3.0, ..., 3.0>.

tlively commented 6 years ago

Thanks, Amara!

Gonzalo, I looked into the different behavior you're seeing with different vector types, but in my setup they all lower to a tree reduction that uses WebAssembly's min instructions, which are all NaN-propagating. I'm interested in why you're seeing non-Nan-propagating behavior for the majority of your test cases. Could you provide the .wasm produced by rustc that exhibits this behavior?

aemerson commented 6 years ago

Well they are named experimental, so out of tree users should be careful not to depend on them too much.

Other targets actually expand these intrinsics back to shuffle sequences in the CodeGen/ExpandReductions.cpp pass, hence why they appear to work. Only AArch64 has implemented direct code generation support for them in any form.

For out of tree users, maybe you can have a workaround where AArch64TTI::shouldExpandReduction is modified to return true if the NoNaN flag isn't present on the intrinsic call. This should result on those calls being replaced with the tree reduction implementation.

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

By the last point, I of course mean long term we want intrinsics without NoNaN flags to be code generated to shuffles.

The problem is that many backends (x86, x86_64, ppc, ppc64, mips, mips64, wasm32, sparc64, ...) do already polyfill these intrinsics without the NoNaN flag, making whatever semantics they implement the "defacto semantics" of these intrinsics because the LangRef does not state any of this.

Until now, I actually thought that the Aarch64 backend was just extremely buggy because LLVM just ICE'd when using these intrinsics without NoNaN while all other backends handle them "just fine" =/

aemerson commented 6 years ago

By the last point, I of course mean long term we want intrinsics without NoNaN flags to be code generated to shuffles.

aemerson commented 6 years ago

Support for code generating these intrinsics without the NoNaN flag present hasn't been implemented yet. There is no native reduction instruction to do this operation, hence for AArch64 TTI::useReductionIntrinsic() returns false if the NoNaN flag isn't present. If I run the IR test case with llc I get an assertion failure: Assertion failed: (Op->getFlags().hasNoNaNs() && "fmin vector reduction needs NoNaN flag"), function LowerVECREDUCE

The intended way to use these intrinsics at the moment is to use the API in LoopUtils.cpp, the createTargetReduction() or createSimpleTargetReduction() helpers depending on whether you've got a reduction descriptor available.

Long term we want to have these intrinsics expand to shuffle sequences at codegen time, instead of having to ask the target which reduction representation they support. I haven't got around to it yet though.

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

Forgot to add: in most targets, the best way to implement a min/max reduction is to do a tree reduction. IIRC with an fcmp+select one just returns the first or second operand if one of them is a NaN, but the minimum and minimumNumber semantics require to consistently return the NaN or non-NaN operands, and because of this a tree-reduction should be possible.

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

Gonzalo, can you elaborate on why you need these operations to be non-NaN-propagating?

I don't need these semantics per se. I do, however, need the same semantics on all architectures independently of the floating point type (f16 vs f32 vs f64) and the vector length, otherwise these intrinsics would be of little use (e.g. if their behavior would change per architecture, floating point type, and/or vector length).

llvm.experimental.vector.reduce.fmin has fmin on its name. The fmin libm function is specified to implement IEEE 754:2018 9.6.minimumNumber:

minimumNumber(x, y) is x if x < y, y if y < x, and the number if one operand is a number and the other is a NaN. [...]

That is, if either operand is a NaN, the non-Nan operand is returned.

IIRC this is the current behavior of this intrinsic for all vectors of float and double that have less than 16 elements on wasm32, but the behavior for vectors with 16 and 32 elements appears to differ (hence this issue). IIRC this is also the current behavior on x86_64, some x86 targets, sparc64, mips32, mips64.

Whether this is the intended behavior, I don't know, as you have discovered, the LangRef is really vague. AFAICT the alternative here is to implement the IEEE 754:2018 9.6.minimum semantics:

minimum(x, y) is x if x < y, y if y < x, and a quiet NaN if either operand is a NaN. [...]

That is, if either operand is a NaN, a NaN is returned.

IIRC, LLVM has an llvm.minnum intrinsic that implements the minimumNumber semantics (or something similar to those), and to implement the minimum semantics one has to do fcmp+select IIRC.

tlively commented 6 years ago

Gonzalo, can you elaborate on why you need these operations to be non-NaN-propagating? The LLVM language reference does not specify the NaN-propagation behavior of these intrinsics, and the NaN-propagating version lowers to far fewer wasm instructions than the non-NaN-propagating version would. It seems that this is working as intended to me.

tlively commented 6 years ago

Adding Amara Emerson, since this behavior is a result of how llvm.experimental.vector.reduce.{fmin,fmax} is lowered.

Amara, the lang ref says "If the intrinsic call has the nnan fast-math flag then the operation can assume that NaNs are not present in the input vector." But here there is no nnan fast-math flag. Can you clarify what the expected behavior is when the vector contains a NaN? From what I can tell, the current implementation assumes there are no NaNs. In particular, llvm::createMinMaxOp in LoopUtils.cpp always attaches all fast math flags to the emitted comparison, whether the vector reduce intrinsic was called with fast math flags or not.

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

Would you be able to provide the LLVM IR emitted for the f32x16 version of the function that fails the test?

This one:

define float @min_16x32(<16 x float> %a) { %b = tail call float @llvm.experimental.vector.reduce.fmin.f32.v16f32(<16 x float> %a) ret float %b } declare float @llvm.experimental.vector.reduce.fmin.f32.v16f32(<16 x float>)

tlively commented 6 years ago

Thanks for the report! Would you be able to provide the LLVM IR emitted for the f32x16 version of the function that fails the test?

54aefcd4-c07d-4252-8441-723563c8826f commented 6 years ago

assigned to @tlively

llvm / llvm-project

wasm32 reduce.fmin/fmax work incorrectly for <float x 32> vectors containing NaNs #37947

Extended Description