Missed SSE optimizations: properties of math functions and more

geometrian commented 2 years ago

Consider the following (compile with -std=c++2b -O3 -fno-math-errno; all tests run with Clang 16.0.0 from trunk):

float test(float radicand,float sign) {
    float radical = std::sqrt(radicand);
    return std::copysign(radical,sign);
}

This produces, in substance:

sqrtss  xmm0, xmm0
andps   xmm1, xmmword ptr [rip + .LCPI0_0]   #0x80000000, signbit mask
andps   xmm0, xmmword ptr [rip + .LCPI0_1]   #0x7fffffff, rest-of-float mask
orps    xmm0, xmm1

As we know, the real-valued square-root requires, and produces, a non-negative number. And yet, the result is still andpsed to remove the sign bit. This appears to be pointless.

Now, for negative numbers, sqrtss returns a NaN—in fact, a -NaN (empirically: negative numbers produce 0xFFC00000, regardless of magnitude, NaN arguments and -0.0f are passed through unchanged). Perhaps Clang is trying to preserve such behavior?

We can try to add assume statements in the likely event we don't care. However, none of e.g. the following appear to have any effect:

//Note also ensures not NaN
//Note using `>` instead of `>=` to demonstrate not being tripped up by -0
__builtin_assume(radicand>0.0f);

__builtin_assume(radical>0.0f); //(similar note)

__builtin_assume(!std::isnan(radical));

The following don't work either (while also producing an erroneous -Wassume warning the argument to '__builtin_assume' has side effects that will be discarded [-Wassume]; that's probably something that should be looked at):

__builtin_assume( (std::bit_cast<unsigned>(radical)&0x7FFFFFFFu) == std::bit_cast<unsigned>(radical) );

__builtin_assume( (std::bit_cast<unsigned>(radical)&0x80000000u) == 0u );

In fact, all of these together still have no effect.

The impact in this particular case might appear low—a single useless bitop, adding a cycle if they can't execute superscalar (e.g. if the other andps can be moved up) and the cost of storing the mask.

The reason I'm reporting this is that I think this class of missed opportunity is likely to exist widespread throughout the implementation in analogous cases. Many functions in the math and broader standard library assume bounded inputs, and probably even more produce bounded outputs. The missed opportunity, considered with all the nonlinear ways small optimizations can exceed the effect of their individual contributions when working together, is probably quite significant, especially for math-heavy code (such as I am interested in).

The language-anointed solution to such will likely be contracts, but those don't exist yet and it's not like language design changes what math is. At the end of the day, the compiler changes are the same anyway, enforcing contracts on math functions and pumping that information into the optimizer.

llvmbot commented 2 years ago

@llvm/issue-subscribers-backend-x86

rotateright commented 2 years ago

Are you expecting different x86 asm in the base case (no __builtin_assume)?

geometrian commented 2 years ago

Yes. The C standard (to which the C++ standard punts for this function) says of std::sqrt(...) that an implementation-defined value is returned if an argument that is not a nonnegative number is supplied. Additionally, for IEEE FP in particular, that should be a NaN. There is nothing that says the signedness or contents of a NaN needs to be retained (in fact, AFAICT the signedness and contents of NaN are considered implementation-dependent, even when the standard specifies IEEE FP). Thus, removing the second andps instruction is standard-compliant. Since that is also just better, the compiler should indeed remove it, regardless of any programmer-provided assumptions.

It is conceivable that someone somewhere relies on the internals of NaNs despite the lack of any guarantee, and so Clang is trying to be nice, yet even specifying -ffast-math (implying signed zeros, infinities, and NaN in general don't need to be respected, or even used) doesn't get Clang to drop the redundant instruction. And, of course, the assumptions should make the programmer's intent clear.

Just to clarify again though, this is not about std::sqrt(...) per-se. This is about domains and ranges for functions, and especially math functions, throughout the standard library, that could enable this broad category of optimizations.

rotateright commented 2 years ago

But IEEE-754 says "squareRoot(−0) shall be −0." How do you propose to handle that corner case without masking the sign bit?

RKSimon commented 2 years ago

It seems a shame knownbits fails to determine when a float's sign bit is known to be zero (positive)

https://godbolt.org/z/TTPhb1G9x

float test(float radicand,float sign) {
    radicand = std::copysign(radicand, 1.0f);
    float radical = std::sqrt(radicand);
    return std::copysign(radical,sign);
}

llvm / llvm-project

Missed SSE optimizations: properties of math functions and more #58415