rust-lang / stdarch

Rust's standard library vendor-specific APIs and run-time feature detection
https://doc.rust-lang.org/stable/core/arch/
Apache License 2.0
599 stars 260 forks source link

_mm512_reduce_add_ps and friends are setting fast-math flags they should not set #1533

Closed RalfJung closed 1 month ago

RalfJung commented 5 months ago

Today I learned about the existence of the simd_reduce_add_unordered intrinsic. When called on a float, this compiles to LLVM's vector.reduce.fadd with the "fast" flag set, which means that passing in NAN or INF is UB and optimizations are allowed "to treat the sign of a zero argument or zero result as insignificant" (which I think means the sign of input zeros is non-deterministically swapped and returned zeros have non-deterministic sign).

This intrinsic is not used a lot in stdarch, but it has a total of 8 uses (all in avx512f.rs). 4 of these are integer intrinsics, where this should be entirely equivalent to simd_reduce_add; not sure why the "unordered" version is used. The other 4 are float intrinsics, _mm512_reduce_add_ps being the first:

https://github.com/rust-lang/stdarch/blob/4d9c0bb591336792c4c4baf293d0acc944e57e28/crates/core_arch/src/x86/avx512f.rs#L31262-L31270

/// Reduce the packed single-precision (32-bit) floating-point elements in a by addition. Returns the sum of all elements in a.
///
/// [Intel's documentation](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm512_reduce_add_ps&expand=4562)
#[inline]
#[target_feature(enable = "avx512f")]
#[unstable(feature = "stdarch_x86_avx512", issue = "111137")]
pub unsafe fn _mm512_reduce_add_ps(a: __m512) -> f32 {
    simd_reduce_add_unordered(a.as_f32x16())
}

Neither the docs here nor Intel's docs mention that this is UB on NAN or INF, and the concerns around signed zeros and doing the addition in an unspecified order. Given that the Intel docs should be the authoritative docs (since this is a vendor intrinsic), why is it even correct to use fast-math flags here? Either the docs need to be updated to state the fast-math preconditions, or the implementation needs to be updated to avoid the fast-math flag. Maybe it should only use "reassoc", not the full but unsafe "fast" flag? But even that should probably be mentioned in the docs.

Amanieu commented 5 months ago

It is definitely incorrect for these intrinsics to be using the fast-math flag.

Here is the LLVM IR that clang generates for this intrinsic:

%0 = tail call reassoc noundef float @llvm.vector.reduce.fadd.v16f32(float -0.000000e+00, <16 x float> %x)
RalfJung commented 5 months ago

That allows LLVM to add the elements in any order, and also do re-association optimizations when the result is fed into another reassoc function. I don't see how that matches the Intel docs which describe a very particular order of summation:

DEFINE REDUCE_ADD(src, len) {
    IF len == 2
        RETURN src[31:0] + src[63:32]
    FI
    len := len / 2
    FOR j:= 0 to (len-1)
        i := j*32
        src[i+31:i] := src[i+31:i] + src[i+32*len+31:i+32*len]
    ENDFOR
    RETURN REDUCE_ADD(src[32*len-1:0], len)
}
dst[31:0] := REDUCE_ADD(a, 16)

(It seems like Intel uses array[last_elem:first_elem] syntax for bitwise subslicing, which must be the least intuitive subslicing syntax I have ever seen.)

RalfJung commented 5 months ago

Specifically, if I were to do something like

let sum = _mm512_reduce_add_ps(a);
let vec = _mm512_set_pd(sum, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0);
let sum2 = _mm512_reduce_add_ps(vec);

then in my reading of reassoc, LLVM would be allowed to arbitrarily reorder all the 15 elements being summed up here when computing sum2. There's nothing that constrains it to only reassociate "within" a single _mm512_reduce_add_ps.

I don't think this is a correct implementation of the Intel vendor intrinsic.

workingjubilee commented 5 months ago

I do not believe it is, no.

workingjubilee commented 5 months ago

@nikic has confirmed to me that Ralf's concerns about reassoc being allowed to potentially do "basically whatever" are accurate, with the caveat that all the relevant operations have to be tagged with reassoc. They can't "jump" between reassoc to non-reassoc to reassoc.

So yes, these are very not correct implementations.

workingjubilee commented 5 months ago

The problem is that the lack of the reassoc doesn't seem to be correct either? The ordering is not serial.

RalfJung commented 5 months ago

Intel specifies a very specific order of summation. It's not left-to-right, which is what the no-reassoc version would do. From what I understand there is anyway no hardware operation that actually performs this particular kind of summation, so either LLVM needs to have support for this specific operation (and lower it to the best instruction sequence), or we need to do the lowering ourselves in the implementation of _mm512_reduce_add_ps.

workingjubilee commented 5 months ago

@RalfJung Hmm. There are a few different possible sequences a compiler can use, but one of the "obvious" ones is a sequence that just repeatedly uses the "do one round of tree-reduction" instruction, which works like you might imagine from that description I just gave.

nikic commented 5 months ago

Yes, in practice reassoc on reductions produces a tree reduction. Of course, this is not guaranteed from a semantics perspective.

workingjubilee commented 5 months ago

@RalfJung fwiw, Niki mentions that "perform a tree reduction" was proposed in the past as a possible annotation for the reduces, so perhaps that's the tree we should be barking up this time.

RalfJung commented 5 months ago

If that's a possibility then that would make most sense, yes -- have an intrinsic that reduces in a well-defined order that matches what the Intel docs say (i.e., tree reduction).

RalfJung commented 5 months ago

My understanding is that clang generates the same IR, so we should probably file this as an LLVM issue as well. How does one call these intrinsics in C?

Amanieu commented 5 months ago

My understanding is that clang generates the same IR, so we should probably file this as an LLVM issue as well. How does one call these intrinsics in C?

https://godbolt.org/z/7Wbhjeo7n

RalfJung commented 5 months ago

Thanks! Filed an issue: https://github.com/llvm/llvm-project/issues/82813

RalfJung commented 5 months ago

@nikic it seems I don't know how to talk to LLVM people, they don't seem to agree with me on what it even means to have a LangRef. :/ Maybe you can help move the discussion in https://github.com/llvm/llvm-project/issues/82813 somewhere productive?

sayantn commented 1 month ago

We can also not use simd_reduce_add_unordered, because as LLVM says (at least in LangRef), there is no guarantee of associativity in vector.reduce.add, so we can do what gcc does, and hand-code the reduce-add ourselves. I did a small implementation in Godbolt for _mm512_reduce_add_ps here. It seems like LLVM is doing a spurious zero addition - I am no expert on floating point, but I think that addition with +0.0 is nop

Amanieu commented 1 month ago

This is now fixed by https://github.com/rust-lang/stdarch/pull/1594: all of these are now implemented by explicitly expanding to a sequence of operations instead of using the LLVM intrinsics.

RalfJung commented 1 month ago

Should we remove the unordered intrinsics or are they still useful?

Amanieu commented 4 weeks ago

They may still be useful for generic simd (cc @workingjubilee)