Our floating point semantics were a mess

RalfJung commented 4 years ago

Floating-point semantics are hard, in particular the NaN part, but this should describe them accurately -- except on x86-32, for more than one reason.

We still need to officially decide that those are our intended semantics though. https://github.com/rust-lang/rust/issues/73328 tracks that on the rustc side.

Historic info

There are several ways in which our de-facto FP semantics currently are broken: * LLVM doesn't preserve NaN bits (which is likely incoherent). That's https://github.com/rust-lang/rust/issues/73328, which is a meta-issue. Specific issues: rust-lang/rust#55131 (on all platforms) * LLVM uses the x87 instructions and registers on i686 which behave different from how IEEE floats should behave. That's https://github.com/rust-lang/rust/issues/72327 (without SSE2) and https://github.com/rust-lang/rust/issues/73288 (with SSE2). * LLVM does not distinguish sNaN and qNaN the way IEEE mandates: https://github.com/rust-lang/rust/issues/107247. I'm adding this here because figuring out the semantics of Rust is part of the goal of the UCG (I think?) and we should have an overarching "FP semantics" tracker *somewhere*.

workingjubilee commented 4 years ago

Worth noting: we can't expect floating point to be handled consistently on all architectures and over all datatypes that use floating points, yet it might be confusing if floating points have wildly different behavior based on the box they're in. https://github.com/rust-lang/rfcs/pull/2977#issuecomment-683443821

scottmcm commented 4 years ago

This also reminds me of https://github.com/rust-lang/rust/issues/75786 -- We should probably have an explicit accuracy promise documented somewhere, including for library functions.

(Hopefully we can at least promise ±1ULP everywhere, though some things really ought to be ±½ULP.)

RalfJung commented 4 years ago

Worth noting: we can't expect floating point to be handled consistently on all architectures and over all datatypes that use floating points

Usually, that's why specs can be under-specified -- they can leave room for implementations to differ, either via non-determinism (that's what wasm does for NaNs; would be interesting what they do for denormals) or via introducing unspecified target-specific "canonicalization" or similar mechanisms.

This also reminds me of rust-lang/rust#75786 -- We should probably have an explicit accuracy promise documented somewhere, including for library functions.

That's just trig (and other) functions being platform-specific (the division/multiplication in the title is a red herring). So it's related but the issue here is about how operations provided by the core language behave. We can worry about trig functions once we have that sorted out. ;)

workingjubilee commented 4 years ago

Adding some background data: I recently discovered that Intel recommends setting the denormals-are-zero and flush-to-zero flags on their C++ compiler unless floating point denormal accuracy is application critical: https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/floating-point-operations/understanding-floating-point-operations/setting-the-ftz-and-daz-flags.html

In addition, the fact that a floating point op can vary in speed by 2 orders of magnitude is a rich source of data for timing attacks: https://cseweb.ucsd.edu/~dkohlbre/papers/subnormal.pdf

Also, explicit fused multiply-add ops are part of the IEEE754-2008 standard and usually can lead to increased precision by reducing the number of roundings that occur... but that can also reduce the accuracy of later operations that encode assumptions like "this is commutative, right?"

RalfJung commented 4 years ago

In addition, the fact that a floating point op can vary in speed by 2 orders of magnitude is a rich source of data for timing attacks: https://cseweb.ucsd.edu/~dkohlbre/papers/subnormal.pdf

Speed is not currently part of the Rust specification, so that aspect, while certainly relevant in general, is unrelated to specifying floating-point behavior in Rust.

Also, explicit fused multiply-add ops are part of the IEEE754-2008 standard and usually can lead to increased precision by reducing the number of roundings that occur... but that can also reduce the accuracy of later operations that encode assumptions like "this is commutative, right?"

AFAIK LLVM does not introduce FMUs unless we explicitly set a flag, so right now I think there is no problem here. Unless you want to suggest we should introduce FMUs; that could complicate our floating-point story even further but in ways that are orthogonal to the problems listed here. There was at least one attempt at an RFC here (https://github.com/rust-lang/rfcs/pull/2686), but since it is a hypothetical future (we don't introduce FMUs right now) I don't think it needs tracking here.

Adding some background data: I recently discovered that Intel recommends setting the denormals-are-zero and flush-to-zero flags on their C++ compiler unless floating point denormal accuracy is application critical: https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/floating-point-operations/understanding-floating-point-operations/setting-the-ftz-and-daz-flags.html

What does this mean for Rust programs? This sounds a bit like a fast-math style flag? Similar to the optimizer introducing FMUs, to my knowledge the complications around that are orthogonal to figuring out what the Rust semantics for completely standard IEEE floating point operations are when NaNs or weird platforms (i686) are involved.

Fast-math is its own can of worms, and very little is known about how to specify it precisely. In contrast, the issue here is mostly about figuring out what LLVM (and, by extension, hardware) actually does; writing a reasonable spec is easy ("copy wasm"), but making sure LLVM conforms to that spec is hard because, as usual, LLVM IR behavior is scarcely documented.

Or is this related to the denormal problem around SSE that you raised earlier?

Lokathor commented 4 years ago

I think a separate issue is warranted for denormals and fast math issues, as long as a fix to our NaN and x87 issues don't block potential avenues there they're almost entirely separate.

antoyo commented 2 years ago

Not sure where to post this comment, but rustc_codegen_gcc has a NaN with a different sign than Rust with LLVM. I'm wondering what to do about this, but I guess the solution has not been figured out.

digama0 commented 2 years ago

@thomcc wrote a lengthy post about this in https://rust-lang.zulipchat.com/#narrow/stream/219381-t-libs/topic/float.20total_cmp/near/270506799, and the surrounding conversation is also relevant.

workingjubilee commented 2 years ago

I believe I have devised a path forward which may allow us to cut out the i686 issue, by simply having Rust functions use the calling convention that LLVM annotates them with after optimizations and guaranteeing SSE2 is enabled and x87 is disabled outside extern "C" and the like.

workingjubilee commented 2 years ago

As far as the NaN thing:

Both GCC and LLVM incorrectly conflate "the NaN payload is implementation defined" with "the sign is undef", as you can simply issue copysign on them all day, both to and from. That defines the sign, and it can then be recovered, and that is not supposed to change the NaN value otherwise. So -NaN and +NaN are separate domains in the IEEE754 "universe", that both are subdomains of NaN, and they are both wrong. The choices they have made may be technically legalized under very strained interpretations of the standard, but they are not truly correct when used to justify optimizations.

RalfJung commented 2 years ago

Wait, does LLVM really treat any part of a NaN as undef? That would be a problem -- it would make f32::to_bits unsound since one could use to to safely produce (partially) undef u32.

workingjubilee commented 2 years ago

I think they treat producing a NaN value as logically originating from an undefined origin every time it is produced, and thus feel entitled to select an arbitrary value each time. So perhaps I should say they treat it as round-tripping through the whocares?(nan) -> nan function, which is, er, a non-injective function, to say the least. :^) I believe by the time they actually place the NaN in the operation they treat it as a defined value.

RalfJung commented 2 years ago

Ah okay. As long as whocares produces defined bits (not undef/poison), we are good. (For some values of of "good"...)

RalfJung commented 1 year ago

Cc @thomcc into this thread in case they haven't seen it yet (not a new thread, more like a meta issue).

On the LLVM forum there's a discussion around their NaN semantics, if there are some questions we have to add there from the Rust side, now might be a good time. :)

thomcc commented 1 year ago

Ah, I'll try to comment in that thread tomorrow. I already see many incorrect comments -- I think they've been mislead by the name canonicalization (every NaN for binary floating point is canonical, the only non-canonical ones are in decimal floating point types...)

RalfJung commented 1 year ago

(For the purpose of this discussion, let us ignore x86-32 and subnormals on ARM -- I currently expect those targets will have to have notes added to them on the target support page stating that floating point support on those targets is incomplete. We can separately discuss what can be done to make these targets as close to conforming as possible, but that is another topic.)

@workingjubilee my impression is that the LLVM semantics are consistent with saying that every float operation (except for a few that are defined to act on the bits: copy, negate, abs, copySign, according to this), if it returns a NaN, non-deterministically picks the NaN payload and sign for that result . This is like freeze undef for those bits, but crucially, it is not like undef. Is that not correct?

The most contentious point here seems to be whether the sign bit of a NaN is truly non-deterministic like that. This seems to agree with that interpretation, as does this and also wasm says that the sign is non-det. But we also have @thomcc stating things like

LLVM's APFloat currently gets the sign bit of NaNs wrong on many architectures, including common ones like x86_64.

which indicates that there is such a thing as the 'right' sign of a NaN, so unless this was specifically referring to one of these bitwise operations it seems to contradict the other citations above.

I was about to say that this would be exactly what wasm does, but it seems they changed: looking at wasm again, it now says

If the payload of all NaN inputs to the operator is canonical (including the case that there are no NaN inputs), then the payload of the output is canonical as well.

They define 'canonical' themselves, so I don't think it is the same 'canonical' as what @thomcc refers to.

A canonical NaN is a floating-point value [with a] payload whose most significant bit is 1 while all others are 0.

bjorn3 commented 1 year ago

LLVM's APFloat currently gets the sign bit of NaNs wrong on many architectures, including common ones like x86_64.

which indicates that there is such a thing as the 'right' sign of a NaN, so unless this was specifically referring to one of these bitwise operations it seems to contradict the other citations above.

I think that was meant not matching what actual x86_64 hardware does.

RalfJung commented 1 year ago

Ah, fair. But it does represent an expectation that Rust actually promises to match the hardware here -- which neither wasm nor LLVM seem inclined to do. So I think we shouldn't do it in Rust, either. Which would mean whatever apfloat does with NaN signs is correct by definition (except on fneg, fabs, copysign where the output must match the spec).

thomcc commented 1 year ago

I think we should try to match the hardware for signs (and ideally payloads), on targets where it's deterministic. That doesn't seem like an undue burden to me (the main difficulties being that it requires an LLVM change, and around rustc_apfloat's license).

RalfJung commented 1 year ago

I'm not a big fan of importing target semantics into the language like that... and also LLVM seems entirely uninterested in providing that guarantee, so can this even be more than wishful thinking?

thomcc commented 1 year ago

I got the impression that they were mostly okay with fixing it for the "default nan" case (annoyingly called canonical in some places, despite nan canonicalization being a completely different thing), which is the main[^1] case where we get this wrong currently.

[^1]: The other case where we get it wrong is when there are multiple NaN inputs to an operation, and promising anything about this case does seem fairly reasonable to punt on.

RalfJung commented 1 year ago

At least on the wasm side, the following applies to the payload only, not the sign:

If the payload of all NaN inputs to the operator is canonical (including the case that there are no NaN inputs), then the payload of the output is canonical as well.

thomcc commented 1 year ago

I'm actually a lot less concerned about NaN payloads, since the only case we behave in a manner inconsistent with the target is on x86 with x87 floats, which there's not a lot we can really do about (short of changing the Rust ABI to reduce the likelihood of this, which would probably be good anyway).

The sign of NaNs is something we get wrong in a way that causes consistent confusion and bugs in user program, I get that fixing it is annoying (especially given wasm), but it seems worth it to me.

thomcc commented 1 year ago

One footgun caused by the NaN sign-bit being non-deterministic is that APIs like https://doc.rust-lang.org/nightly/std/primitive.f32.html#method.total_cmp (and thus https://doc.rust-lang.org/nightly/std/primitive.slice.html#method.sort_floats) will sort negative NaN at the very front, and positive NaN at the very back, which is an extremely visible difference.

In practice, this non-determinism shows up as "behaves one way with optimizations on, and another with them off". This means that code that that sorts slices that contain NaN will likely pass tests that assume they behave one way, and then behave incorrectly in production.

RalfJung commented 1 year ago

The sign of NaNs is something we get wrong in a way that causes consistent confusion and bugs in user program, I get that fixing it is annoying (especially given wasm), but it seems worth it to me.

For wasm there is no fixing this, wasm makes no NaN sign guarantees. We can only fix this for some targets, not all, by introducing target-specific guarantees.

I get that non-determinism is annoying, though whether it is more annoying than making this target-specific, I am not convinced. For Miri, the non-deterministic option is certainly easier, we can just pick a random sign bit after each float operation. In the target-specific case we'd have to have a big page stating for each and every target what our float sign behavior is (with a fallback to "non-deterministic"), and Miri would need to have a machine-readable version of that same data.

I am also honestly not sure we have a choice here. The fact that wasm guarantees something for payloads but punts on the sign is a bad sign. Have you tried getting LLVM to acknowledge that 'NaN sign is different between debug and release builds' is even a bug? Because if LLVM doesn't want to provide this guarantee, I think there is naught we can do.

RalfJung commented 1 year ago

The sign of NaNs is something we get wrong in a way that causes consistent confusion and bugs in user program, I get that fixing it is annoying (especially given wasm), but it seems worth it to me.

To comment on this a bit more: programs that make assumptions here will be wrong on some targets. So is it really better to make them only wrong on wasm? That just makes it even less likely that people are aware of this.

RalfJung commented 1 year ago

I guess the only way to really avoid this would be to convince wasm to make this 'consistent non-determinism', where the NaN sign must be the same each time if the inputs are the same, or something like that. They might be open to that.

But honestly to me this sounds more like a case of "we specify the non-determinism now, and tell users they need to deal with that, and in parallel we work with LLVM and wasm towards being able to guarantee more". That seems better than leaving things entirely in limbo.

thomcc commented 1 year ago

But honestly to me this sounds more like a case of "we specify the non-determinism now, and tell users they need to deal with that, and in parallel we work with LLVM and wasm towards being able to guarantee more". That seems better than leaving things entirely in limbo.

This sounds very reasonable to me.

RalfJung commented 1 year ago

Okay. :) To be clear I won't have the time to push on that, I mostly care to fill this gaping hole in our semantics with something. So there is a risk that not much progress happens after we specify some basic non-deterministic overapproximation. The most extreme overapproximation would be to just say that sign and payload are non-det each time a NaN is generated; restricting the payload somewhat should also be possible but that depends on the outcome of the LLVM discussion.

Let's see how the LLVM discussion turns out, and what @workingjubilee thinks. Then this might be ready for an RFC? I feel like discussion splits into 3 branches here:

figure out what exactly the RFC should specify
figure out which targets violate that RFC, and clearly document that and possible mitigation strategies (this might even be part of the RFC)
figure out what could be done to reduce the non-determinism that the RFC permits

thomcc commented 1 year ago

To be clear I won't have the time to push on that

Yeah, I'll try but it probably won't be until after the new year (too busy until then). I suspect trying to get this fixed in LLVM first, and investigating the rest (and whatever figure out why it's still non-deterministic in WebAssembly). On the bright side, even 32-bit x86 preserves NaN signs (just not NaN payloads), so I'm somewhat optimistic this can be resolved.

So there is a risk that not much progress happens after we specify some basic non-deterministic overapproximation

It is what it is.

This probably should block stabilization of slice::sort_floats (https://github.com/rust-lang/rust/issues/93396), at least with its current implementation. If we're going to decide that the sign of NaN is non-deterministic, we should not stabilize a sorting function that puts NaNs as far apart as possible based on their sign.

I had brought this up with float::total_cmp before, but it seems I only did so on zulip, and not in its tracking issue. I had not expected us to decide this is expected behavior (rather than a bug), so I didn't push on it very hard.

conrad-watt commented 1 year ago

Just popping in from the Wasm side to say that I'm happy to have conversations here. Apologies if I've missed this being brought up earlier in this issue, but one of the motivations for our current semantics was a divergence of NaN bit patterns between x86 and Arm - see this comment. Presumably LLVM has to worry about something similar if it wants to give a semantics that abstracts over both?

workingjubilee commented 1 year ago

As I stated elsewhere, those are compile-time constants. Rust also happens to abstract over usize being different sizes on different targets: it's not actually that hard.

workingjubilee commented 1 year ago

LLVM does not necessarily abstract over compilation targets as much as you might think it does. I have several times asked LLVM developers about how LLVM would handle various somewhat unusual compilation requests, and been frequently told that the frontend should lower to operations the machine can specifically perform. There are some things that LLVM does abstract over but less than you might imagine when you get to actually pushing on the edges of the LangRef, so it is a "target independent" code generator only if you apply a huge series of asterisks to that, even before we introduce actual inline assembly to the mix.

RalfJung commented 1 year ago

I don't think we have a concept of target-specific constants currently, but it sounds like a reasonable enough thing to add.

sunfishcode commented 1 year ago

That's a fundamental difference between Rust on all platforms other than Wasm, and Wasm. Rust doesn't make any attempt to hide the target architecture; source code can explicitly ask what the architecture is. In Wasm, the target architecture is hidden; it's not known at all at source-code compile time, and at runtime it's only observable if you know what NaN bits to look for, or maybe if you look really closely at the behavior of memory shared between threads.

On everything except Wasm, Rust could be ok saying "the NaN generated by 0.0/0.0 is a target-specific constant". But Wasm fundmentally doesn't know what the target architecture is going to be until at least program startup time. So on Wasm, making it a compile-time constant would require taking a performance hit on at least one popular architecture. We don't have comprehensive data, but one benchmark suggests it could be in the 5%-15% range on floating-point code.

RalfJung commented 1 year ago

Oh right, so this is not really like usize.

A "value picked non-deterministically once at program startup" would still work though, if wasm would be willing to guarantee that. Though the wasm issue mentions migrating live code from x86 to ARM, so maybe wasm has to stay non-det... in which case the best the Rust can do is say "compile-time constant on some targets, fully non-det on others".

Muon commented 1 year ago

I know that this involves fixing a lot of things with LLVM, just thought I'd put in my two cents before an RFC is written up (FWIW, my PhD topic is on floating-point arithmetic decision procedures). Happy to answer any questions.

Floating-point operations must follow IEEE 754, meaning that non-NaN results of operations other than mod are fully determined by the active rounding mode.
- The default rounding mode is rounding to nearest, ties to even.
- The rounding mode must be changeable in some way.
- Perhaps something like #[rounding_mode(xyz)]? This affects the calling convention, and FPU state changes are very expensive. Caller probably sets?
- Unless otherwise specified, functions are presumed to stipulate rounding to nearest, ties to even.
- It must be possible to say that a function does not care about the current rounding mode.
  - We need this for fusedMultiplyAdd to inherit the active rounding mode, unless there's a version of it for every rounding mode.
Non-NaN results of mod are fully determined independent of rounding mode.
Optimizations must preserve the results of operations bit for bit unless expressly relaxed.
The sign and payload bits of a NaN result are unspecified.
- Bit operations (neg, abs, copySign) may still fix the sign bit.
All floating-point operations must be const, even when NaNs may be produced.
- Whether or not a NaN is produced is a deterministic property, it's just the exact bit pattern that's unspecified.
- This probably requires some form of abstract interpretation.
- Const evaluation must report failure if the result is not deterministic.
- It may also report failure if we ran out of resources trying to compute a deterministic result.
  - This is likely to happen when totalOrder or transmutes are involved.
  - Users of const evaluation of NaNs may alleviate computational pains by checking for NaN results and replacing them with specific, known NaNs.
- Even if the target restricts which NaNs it produces, this computation should not be target-dependent.
  - This should perhaps be controllable?

RalfJung commented 1 year ago

The rounding mode must be changeable in some way.

That is a non-goal for now, as far as I am concerned.

Const evaluation must report failure if the result is not deterministic.

I don't think we want this; also see https://github.com/rust-lang/rfcs/pull/3352. I think we want to just say that const-eval produces some unspecified result; the result is deterministic for any given build of the compiler and target configuration but may otherwise vary.

antoyo commented 1 year ago

The sign and payload bits of a NaN result are unspecified.

Does that mean that it's OK if the GCC and LLVM backends produce a different result for 0.0 / 0.0?

digama0 commented 1 year ago

Isn't that already the case? LLVM will const-eval floats IIRC, and it does not necessarily use the same sign propagation conventions as the target, which means that unless the optimizations exactly line up you can get different results on different compilers for that (not 0.0 / 0.0 literally but something equivalent to it after passing the values through variables and functions).

Muon commented 1 year ago

The rounding mode must be changeable in some way.

That is a non-goal for now, as far as I am concerned.

That's okay, although it would be preferable if you don't close that off. It's really painful to emulate other rounding modes.

Const evaluation must report failure if the result is not deterministic.

I don't think we want this; also see rust-lang/rfcs#3352. I think we want to just say that const-eval produces some unspecified result; the result is deterministic for any given build of the compiler and target configuration but may otherwise vary.

It's not true that we need to relax the restrictions on const evaluation in order to allow floating-point to be used in it. We're currently assuming a fixed rounding mode already (round to nearest, ties to even), so the results of all floating-point computations that don't depend on the unspecified bits of a NaN are already fully determined. Any inconsistency or apparent nondeterminism in NaN-free code is a bug. Among IEEE 754-conforming hardware implementations, the only differences may be in the production of NaNs. Picking an arbitrary NaN among the possible NaNs allowed by IEEE 754 is incorrect, since the hardware will actually produce specific NaNs, we just don't know which. It is surprising if const evaluation produces results which are unobtainable in a runtime evaluation. This is why we must overapproximate. Note again that the only instances in which the result is not determined are those in which we directly or indirectly depend on the unspecified bits of a NaN.

The sign and payload bits of a NaN result are unspecified.

Does that mean that it's OK if the GCC and LLVM backends produce a different result for 0.0 / 0.0?

Yes. That expression does not produce a fully determined result. A NaN is encoded as having any sign bit, all exponent bits set, and at least one significand bit set. The result of that expression will be a quiet NaN, because all computations produce quiet NaNs. Those are the only restrictions on the result. However, IEEE 754 does not require particular encodings for quiet or signaling NaNs, and leaves it to the target. Furthermore, different ISAs just produce different NaNs. In that regard, it cannot be soundly constant-folded without knowing the details of the target (if the ISA even specifies which NaNs it produces in that instance).

RalfJung commented 1 year ago

It is surprising if const evaluation produces results which are unobtainable in a runtime evaluation.

That's exactly what is being discussed in https://github.com/rust-lang/rfcs/pull/3352. I agree it is surprising but I think it is better than the alternatives. (Your statement boils down to "runtime behavior must be superset of compiletime behavior", which is being discussed in the RFC thread, though not yet in the RFC text.) So let's discuss that question there and not here.

Also note that all your statements assume that Rust float semantics are exactly IEEE float semantics, which is not a given. We eventually might adopt something like the wasm semantics which makes more guarantees than IEEE when it comes to NaNs.

In that regard, it cannot be soundly constant-folded without knowing the details of the target (if the ISA even specifies which NaNs it produces in that instance).

This statement is in fact wrong if Rust uses IEEE float semantics. Since under those semantics NaNs bits are picked non-deterministically when the NaN is produced, Rust can constant-fold such operations to an arbitrary NaN.

IOW, Rust does not guarantee that NaNs behave the same way as they would if you were to write assembly by hand. It guarantees that they behave as in a valid IEEE implementation (and maybe even a particular variant of that, as in the wasm case), but it doesn't have to be the same IEEE implementation as what your hardware does.

Muon commented 1 year ago

Also note that all your statements assume that Rust float semantics are exactly IEEE float semantics, which is not a given. We eventually might adopt something like the wasm semantics which makes more guarantees than IEEE when it comes to NaNs.

I am indeed expecting that Rust uses IEEE 754 semantics. After all, it's what almost all hardware implements, and that's fast.

This statement is in fact wrong if Rust uses IEEE float semantics. Since under those semantics NaNs bits are picked non-deterministically when the NaN is produced, Rust can constant-fold such operations to an arbitrary NaN.

An arbitrary quiet NaN, specifically. However, if we're using the hardware FPU, we don't determine which NaNs are quiet, so the result is still target-dependent.

RalfJung commented 1 year ago

I am indeed expecting that Rust uses IEEE 754 semantics. After all, it's what almost all hardware implements, and that's fast.

Hardware refines IEEE, making particular choices for the nondeterministic operations.

For nondeterministic specs it is important to differentiate "Rust implements the spec" and "Rust exactly matches the spec" (refinement vs equivalence / subset of behaviors vs equal set of behaviors). Those things are not the same. I was asking whether Rust exactly matches the spec and you answered a different question by saying that Rust refines/implements the spec. Everyone agrees on the latter but there is some demand for Rust to have less non-determinism than what IEEE offers.

The proposal was to start by saying that Rust fully exploits the non-determinism offered by IEEE without any regard for the current target. This means optimizations and const-eval are free to make choices that the hardware would never make.

Further down the road we might want to refine this, making guarantees not made by IEEE: either by saying that we exactly match hardware behavior (which as you said makes optimizations tricky, and means we have to make some choices regarding how closely const-eval matches runtime behavior) or by carefully making some guarantees that still overapproximate hardware (like what wasm does). FWIW LLVM considers it a non-goal to match hardware behavior so that option would require a bunch of LLVM work before we could even seriously consider it.

Muon commented 1 year ago

Ah, I see. So, exactly matching the spec is in a certain sense ill-posed. The spec requires that both quiet and signaling NaNs exist and that computational operations only return quiet NaNs. That is, there is more than just nondeterminism at play here. For a (nondeterministic) abstract machine to comply, it must still at minimum choose some NaNs to be quiet and some to be signaling. Notably, different machines can make different choices. However, if necessary, Rust can leave the exact bit patterns unspecified.

Implementation-wise, there are a few pitfalls with actually achieving compliance. Firstly, at some point, rustc has to output specific bits for NaNs, and there are real-world CPUs which disagree about signaling and quiet NaNs. Notoriously, MIPS does the opposite of x86 and ARM. Secondly, LLVM is still probably far away from spec compliance, even disregarding NaNs (https://github.com/llvm/llvm-project/issues/44497, https://github.com/llvm/llvm-project/issues/43070, https://github.com/llvm/llvm-project/issues/24913, https://github.com/llvm/llvm-project/issues/25233, https://github.com/llvm/llvm-project/issues/18362).

RalfJung commented 1 year ago

TBH my inclination is to entirely ignore everything related to signalling NaNs for now. AFAIK that is what LLVM does, so some groundwork will be needed elsewhere before we can even attempt to do any better.

Muon commented 1 year ago

That's somewhat tenable, I think? We don't currently have any way of distinguishing sNaN from qNaN. The ordering given by total_cmp (our version of the totalOrder predicate) is based on the standard's recommendation for what should be a signaling NaN, not the underlying implementation. The documentation they may need to be adjusted. We'd also need to document that our operations are not strictly compliant and can in fact return any NaN pattern. That said, I'm not sure this is simpler than saying we return unspecified qNaNs and leaving unspecified which patterns are qNaNs/sNaNs.

RalfJung commented 1 year ago

I have written a pre-RFC to document our current floating-point semantics. Please comment on Zulip or hackmd!

Muon commented 1 year ago

I'll have a look through and comment on it when I get back either tomorrow or when I get back home in a few days.

RalfJung commented 1 year ago

So... I think this is not so much of a mess any more?

For the x86-32-specific trouble, the FCP in https://github.com/rust-lang/rust/pull/113053 went through. These are unambiguously platform-specific bugs but are hard to fix. We now have https://github.com/rust-lang/rust/issues/114479 and https://github.com/rust-lang/rust/issues/115567 tracking this.

Most of the rest boils down to "we don't guarantee much about the bits you get out of a NaN-producing operation". That is tracked in https://github.com/rust-lang/rust/issues/73328. My Pre-RFC describes our current de-facto guarantees.

I assume that RFC will need approval by t-opsem and t-lang. Let's keep an issue open on our side as well until we have a team decision on this.

rust-lang / unsafe-code-guidelines

Our floating point semantics were a mess #237