rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
96.74k stars 12.5k forks source link

floating point to integer casts can cause undefined behaviour #10184

Closed thestinger closed 4 years ago

thestinger commented 10 years ago

Status as of 2020-04-18

We intend to stabilize the saturating-float-casts behavior for as, and have stabilized unsafe library functions that handle the previous behavior. See #71269 for the latest discussion on that stabilization process.

Status as of 2018-11-05

A flag has been implemented in the compiler, -Zsaturating-float-casts, which will cause all float to integer casts have "saturating" behavior where if it's out of bounds it's clamped to the nearest bound. A call for benchmarking of this change went out awhile ago. Results, while positive in many projects, are quite negative for some projects and indicates that we're not done here.

The next steps are figuring out how to recover performance for these cases:

Old status

UPDATE (by @nikomatsakis): After much discussion, we've got the rudiments of a plan for how to address this problem. But we need some help with actually investigating the performance impact and working out the final details!


ORIGINAL ISSUE FOLLOWS:

If the value cannot fit in ty2, the results are undefined.

1.04E+17 as u8
nikomatsakis commented 7 years ago

@CryZe

Wouldn't it make sense to make it behave like overflow? So in a debug build it would panic if you do a cast with undefined behaviour.

This is what I thought initially, but I was reminded that as conversions never panic at present (we don't do overflow checking with as, for better or worse). So the most analogous thing is for it to "do something defined".

comex commented 7 years ago

FWIW, the "kill undef" thing would, in fact, provide a way to fix the memory unsafety, but leaving the result nondeterministic. One of the key components is:

3) Create a new instruction, '%y = freeze %x', that stops propagation of poison. If the input is poison, then it returns an arbitrary, but fixed, value. (like old undef, but each use gets the same value), otherwise it just returns its input value.

The reason undefs can be used to violate memory safety today is that they can magically change values between uses: in particular, between a bounds check and subsequent pointer arithmetic. If rustc added a freeze after every dangerous cast, you would just get an unknown but otherwise well-behaved value. Performance-wise, freeze is basically free here, since of course the machine instruction corresponding to the cast produces a single value, not a fluctuating one; even if the optimizer feels like duplicating the cast instruction for some reason, it should be safe to do so because the result for out-of-range inputs is usually deterministic on a given architecture.

...But not deterministic across architectures, if anyone was wondering. x86 returns 0x80000000 for all bad inputs; ARM saturates for out-of-range inputs and (if I'm reading this pseudocode right) returns 0 for NaN. So if the goal is to produce a deterministic and platform-independent result, it's not enough to just use the platform's fp-to-int intrinsic; at least on ARM you also need to check the status register for an exception. This may have some overhead in itself, and certainly prevents autovectorization in the unlikely event that using the intrinsic didn't already. Alternately, I guess you could explicitly test for in-range values using regular compare operations and then use a regular float-to-int. That sounds a lot nicer on the optimizer…

SimonSapin commented 7 years ago

as conversions never panic at present

At some point we changed + to panic (on debug mode). I wouldn’t be shocked to see as panic in cases that were previously UB.

glaebhoerl commented 7 years ago

If we care about checking (which we should), then we should either deprecate as (is there any use case where it's the only good option?) or at least advise against using it, and move people onto things like TryFrom and TryInto instead, which is what we said we were planning to do back when it was decided to leave as as-is. I don't feel that the cases under discussion are qualitatively different, in the abstract, from the cases where as is already defined not to do any checks. The difference is just that in practice the implementation for these cases is currently incomplete and has UB. A world where you can't rely on as doing checks (because for most types, it doesn't), and you can't rely on it not panicking (because for some types, it would), and it's not consistent, and we still haven't deprecated it seems like the worst of all of them to me.

nikomatsakis commented 7 years ago

So, I think at this point @jorendorff basically enumerated what seems to me to be the best plan:

He enumerated three possibilities. I think the remaining work is to investigate those possibilities -- or at least investigate one of them. That is, actually implement it, and try to get some feeling for how "slow" or "fast" it is.

Is there anyone out there who feels motivated to take a stab at that? I'm going to tag this as E-help-wanted in hopes of attracting some a person. (@oli-obk?)

kornelski commented 7 years ago

Uh, I'd rather not pay price for cross-platform consistency :/ It's garbage-in, I don't care what garbage goes out (however a debug assertion would be super helpful).

Currently all rounding/truncating functions in Rust are very slow (function calls with painstakingly precise implementations), so the as is my last resort hack for fast float rounding.

If you're going to make as anything more than bare cvttss2si, please also add a stable alternative that is just that.

est31 commented 7 years ago

@pornel this isn't just UB of the theoretic kind where stuff is okay if you ignore that it is ub, it has real world implications. I've extracted #41799 from a real world code example.

kornelski commented 7 years ago

@est31 I agree that leaving it as UB is wrong, but I've seen freeze proposed as a solution to UB. AFAIK that makes it a defined deterministic value, you just don't get to say which. That behavior is fine with me.

So I'd be fine if e.g. u128::MAX as f32 deterministically produced 17.5 on x86, and 999.0 on x86-64, and -555 on ARM.

hanna-kruppe commented 7 years ago

freeze would not produce a defined, deterministic, unspecified value. Its result is still "any bit pattern the compiler likes", and it is consistent only across uses of the same operation. This may sidestep the UB-producing examples people have collected above, but it would not give this:

u128::MAX as f32 deterministically produced 17.5 on x86, and 999.0 on x86-64, and -555 on ARM.

For example, if LLVM notices that u128::MAX as f32 overflows and replaces it with freeze poison, a valid lowering of fn foo() -> f32 { u128::MAX as f32 } on x86_64 might be this:

foo:
  ret

(that is, just return whatever was last stored in the return register)

kornelski commented 7 years ago

I see. That's still acceptable for my uses (for cases where I expect out of range values, I do clamping beforehand. Where I expect values in range, but they aren't, then I'm not going to get a correct result no matter what).

retep998 commented 7 years ago

I have no problem with out of range float casts returning arbitrary values as long as the values are frozen so they can't cause further undefined behavior.

nikomatsakis commented 7 years ago

Is something like freeze available on LLVM? I thought that was a purely theoretical construct.

eddyb commented 7 years ago

@nikomatsakis I've never seen it used like that (unlike poison) - it's a planned revamp of poison/undef.

hanna-kruppe commented 7 years ago

freeze does not exist at all in LLVM today. It's only been proposed (this PLDI paper is a self-contained version, but it's also been discussed a lot on the mailing list). The proposal seems to have considerable buy-in, but of course that is no guarantee it will be adopted, much less adopted in a timely manner. (Removing the pointee types from pointer types has been accepted for years and it's still not done.)

bstrie commented 7 years ago

Do we want to open up an RFC to get more widespread discussion on the changes being proposed here? IMO, anything that potentially impacts the performance of as is going to be contentious, but it will be doubly contentious if we don't give people the chance to make their voice heard.

simonbyrne commented 7 years ago

I'm a Julia developer, and I've been following this issue for a while, as we share the same LLVM backend and so have similar issues. In case it's of interest, here is what we have settled on (with approximate timings for a single function on my machine):

Also, I have asked on the mailing list about making the undefined behaviour a little more defined, but did not receive a very promising response.

nikomatsakis commented 7 years ago

@bstrie I'm fine with an RFC, but I think it'd definitely be useful to have data! @simonbyrne's comment is super helpful in that regard, however.

alexcrichton commented 7 years ago

I've toyed around with JS semantics (the modulo @jorendorff mentioned) and the Java semantics which appear to be the "saturation" column. In case those links expire it's JS and Java.

I also whipped up a quick implementation of saturation in Rust which I think (?) is correct. And got some benchmark numbers as well. Interestingly I'm seeing that the saturating implementation is 2-3x slower than the intrinsic, which is different from what @simonbyrne found with only 2x slower.

I'm not entirely sure how to implement the "mod" semantics in Rust...

To me, though, it seems clear that we'll need a slew of f32::as_u32_unchecked() methods and such for those who need the performance.

eddyb commented 7 years ago

it seems clear that we'll need a slew of f32::as_u32_unchecked() methods and such for those who need the performance.

That's a bummer - or do you mean a safe but implementation-defined variant?

Manishearth commented 7 years ago

Is there no option for an implementation defined fast default?

alexcrichton commented 7 years ago

@eddyb I was thinking that we'd just have unsafe fn as_u32_unchecked(self) -> u32 on f32 and such which is a direct analog to what as is today.

I'm certainly not going to claim that the Rust implementation I wrote is optimal, but I was under the impression that when reading this thread determinism and safety were more important than speed in this context most of the time. The unsafe escape hatch is for those on the other side of the fence.

eddyb commented 7 years ago

So there's no cheap platform-dependent variant? I want something that's fast, gives an unspecified value when outside of bounds and is safe. I don't want UB for some inputs and I think that's too dangerous for common use, if we can do better.

hanna-kruppe commented 7 years ago

As far as I am aware, on most if not all platforms the canonical way to implement this conversion does something to out-of-range inputs that isn't UB. But LLVM does not seem to have any way to pick that option (whatever it may be) over UB. If we could convince LLVM developers to introduce an intrinsic that yields an "unspecified but not undef/poison" result on out-of-range inputs, we could use that.

But I'd estimate that someone in this thread would have to write a convincing RFC (on the llvm-dev list), get buy-in, and implement it (in backends we care about, and with a fall-back implementation for other targets). Probably easier than convincing llvm-dev to make the existing casts not-UB (because it side-steps questions like "will this make any C and C++ programs slower"), but still not very easy.

sp-1234 commented 6 years ago

Just in case you will be choosing between these:

Saturation (out-of-range values become IntType::max_value()/min_value()) Modulo (out-of-range values are treated as if by converting to a bigint first, then truncating)

IMO only saturation would make sense here, because absolute precision of floating point quickly drops as values get large, so at some point the modulo would be something useless like all zeroes.

nikomatsakis commented 6 years ago

I marked this as E-needs-mentor and tagged it with WG-compiler-middle since it seems the impl period might be a great time to investigate this further! My existing notes on the plan of record are pretty sparse though, so it'd be great if someone from @rust-lang/compiler wanted to help elaborate a those a bit further!

arielb1 commented 6 years ago

@nikomatsakis

IIRC LLVM are planning to eventually implement freeze, which should allow us to deal with the UB by doing a freeze.

ghost commented 6 years ago

My results so far: https://gist.github.com/s3bk/4bdfbe2acca30fcf587006ebb4811744

The _array variants run a loop of 1024 values. _cast: x as i32 _clip: x.min(MAX).max(MIN) as i32 _panic: panics if x is out of bounds _zero: sets the result to zero if out of bounds

test bench_array_cast       ... bench:       1,840 ns/iter (+/- 37)
test bench_array_cast_clip  ... bench:       2,657 ns/iter (+/- 13)
test bench_array_cast_panic ... bench:       2,397 ns/iter (+/- 20)
test bench_array_cast_zero  ... bench:       2,671 ns/iter (+/- 19)
test bench_cast             ... bench:           2 ns/iter (+/- 0)
test bench_cast_clip        ... bench:           2 ns/iter (+/- 0)
test bench_cast_panic       ... bench:           2 ns/iter (+/- 0)
test bench_cast_zero        ... bench:           2 ns/iter (+/- 0)
sp-1234 commented 6 years ago

Perhaps you need not round the results to integer for individual operations. Clearly there must be some difference behind these 2 ns/iter. Or is it really like this, exactly 2 ns for all 4 variants?

eddyb commented 6 years ago

@sp-1234 I wonder if it's partially optimized out.

ghost commented 6 years ago

@sp-1234 It is too fast to measure. The non-array benchmarks are basically useless. If you force the single-value functions to be functions via #[inline(never)], you get 2ns vs 3ns.

Amanieu commented 6 years ago

@arielb1 I have some reservations regarding freeze. If I understand correctly, a frozen undef can still contain any arbitrary value, it just won't change between uses. In practice, the compiler will probably reuse a register or stack slot.

However this means that we can now read uninitialized memory from safe code. This could lead to secret data being leaked, somewhat like Heartbleed. It's debatable whether this is truly considered UB from the point of view of Rust, but it clearly seems undesirable.

hanna-kruppe commented 6 years ago

I ran @s3bk's benchmark locally. I can confirm the scalar versions are optimized out completely, and the asm for the array variants also looks suspiciously well-optimized: for example, the loops are vectorized, which is nice but makes it hard to extrapolate performance to scalar code.

Unfortunately, spamming black_box doesn't seem to help. I do see the asm doing useful work, but running the benchmark still consistently gives 0ns for the scalar benchmarks (except cast_zero, which shows 1ns). I see that @alexcrichton performed the comparison 100 times in their benchmarks, so I adopted the same hack. I'm now seeing these numbers (source code):

test bench_cast             ... bench:          53 ns/iter (+/- 0)
test bench_cast_clip        ... bench:         164 ns/iter (+/- 1)
test bench_cast_panic       ... bench:         172 ns/iter (+/- 2)
test bench_cast_zero        ... bench:         100 ns/iter (+/- 0)

The array benchmarks vary too much for me to trust them. Well, truth to be told, I'm skeptical of the test benchmarking infrastructure anyway, especially after seeing the above numbers compared to the flat 0ns I got previously. Furthermore, even just 100 iterations of black_box(x); (as a baseline) takes 34ns, which makes it even harder to reliably interpret those numbers.

Two points worth noting:

For the record, I've measured with rustc 1.21.0-nightly (d692a91fa 2017-08-04), -C opt-level=3, on an i7-6700K under light load.


In conclusion, I conclude that don't have reliable data so far and that getting more reliable data seems hard. Furthermore, I strongly doubt that any real application spends even 1% of its wall clock time on this operation. Therefore, I would suggest to move forward by implementing saturating as casts in rustc, behind a -Z flag, and then running some non-artificial benchmarks with and without this flag to determine the impact on realistic applications.

Edit: I would also recommend to run such benchmarks on a variety of architectures (e.g., including ARM) and microarchitectures, if at all possible.

simonbyrne commented 6 years ago

I admit I'm not that familiar with rust, but I think this line is subtly incorrect: std::i32::MAX (2^31-1) is not exactly representable as a Float32, so std::i32::MAX as f32 will be rounded to the nearest representable value (2^31). If this value is used as the argument x, the result is technically undefined. Replacing with a strict inequality should fix this case.

Manishearth commented 6 years ago

Yeah, we had exactly that problem in Servo before. The final solution was to cast to f64 and then clamp.

There are other solutions but they're pretty tricky and rust doesn't expose nice APIs for dealing with this well.

ghost commented 6 years ago

using 0x7FFF_FF80i32 as upper limit and -0x8000_0000i32 should solve this without casting to f64. edit: use the correct value.

simonbyrne commented 6 years ago

I think you mean 0x7fff_ff80, but simply using a strict inequality would probably make the intention of the code clearer.

ghost commented 6 years ago

as in x < 0x8000_0000u32 as f32 ? That would probably be a good idea.

ActuallyaDeviloper commented 6 years ago

I think of all the suggested deterministic options, clamping is geneally most useful one because I think it's done often anyway. If the conversion type would actually be documentated to be saturating, manual clamping would become unnecessary.

I am only a little worried about the suggested implementation because it doesn't properly translate to machine instructions and it relies heavily on branching. Branching makes the performance dependent on specific data patterns. In the test cases given above everything looks (comparatively) fast because the same branch is taken always and the processor has good branch prediction data from many previous loop iterations. The real world will probably not look like that. Additionally the branching hurts the ability of the compiler to vectorize the code. I disagree with the opinion of @rkruppe , that the operation shouldn't also be tested in combination with vectorization. Vectorization is important in high performance code and being able to vectorize simple casts on common architectures should be a crucial requirement.

For the reasons given above, I played around with an alternative branchless and data flow oriented version of @alexcrichton 's cast with saturation semantics and @simonbyrne 's fix. I implemented it for u16, i16 and i32 since they all have to cover slightly different cases which result in varying performance.

The results:

test i16_bench_array_cast       ... bench:          99 ns/iter (+/- 2)
test i16_bench_array_cast_clip  ... bench:         197 ns/iter (+/- 3)
test i16_bench_array_cast_clip2 ... bench:         113 ns/iter (+/- 3)
test i16_bench_cast             ... bench:          76 ns/iter (+/- 1)
test i16_bench_cast_clip        ... bench:         218 ns/iter (+/- 25)
test i16_bench_cast_clip2       ... bench:         148 ns/iter (+/- 4)
test i16_bench_rng_cast         ... bench:       1,181 ns/iter (+/- 17)
test i16_bench_rng_cast_clip    ... bench:       1,952 ns/iter (+/- 27)
test i16_bench_rng_cast_clip2   ... bench:       1,287 ns/iter (+/- 19)

test i32_bench_array_cast       ... bench:         114 ns/iter (+/- 1)
test i32_bench_array_cast_clip  ... bench:         200 ns/iter (+/- 3)
test i32_bench_array_cast_clip2 ... bench:         128 ns/iter (+/- 3)
test i32_bench_cast             ... bench:          74 ns/iter (+/- 1)
test i32_bench_cast_clip        ... bench:         168 ns/iter (+/- 3)
test i32_bench_cast_clip2       ... bench:         189 ns/iter (+/- 3)
test i32_bench_rng_cast         ... bench:       1,184 ns/iter (+/- 13)
test i32_bench_rng_cast_clip    ... bench:       2,398 ns/iter (+/- 41)
test i32_bench_rng_cast_clip2   ... bench:       1,349 ns/iter (+/- 19)

test u16_bench_array_cast       ... bench:          99 ns/iter (+/- 1)
test u16_bench_array_cast_clip  ... bench:         136 ns/iter (+/- 3)
test u16_bench_array_cast_clip2 ... bench:         105 ns/iter (+/- 3)
test u16_bench_cast             ... bench:          76 ns/iter (+/- 2)
test u16_bench_cast_clip        ... bench:         184 ns/iter (+/- 7)
test u16_bench_cast_clip2       ... bench:         110 ns/iter (+/- 0)
test u16_bench_rng_cast         ... bench:       1,178 ns/iter (+/- 22)
test u16_bench_rng_cast_clip    ... bench:       1,336 ns/iter (+/- 26)
test u16_bench_rng_cast_clip2   ... bench:       1,207 ns/iter (+/- 21)

The test was run on an Intel Haswell i5-4570 CPU and Rust 1.22.0-nightly. clip2 is the new branchless implementation. It agrees with clip on all 2^32 possible f32 input values.

For the rng benchmarks, random input values are used that hit different cases often. This uncovers the extreme performance cost (roughly 10 times the normal cost!!!) that occurs if branch prediction fails. I think it is very important to consider this. It's not the average real world performance either, but it's still a possible case and some applications will hit this. People will expect a f32 cast to have consistent performance.

Assmbly comparison on x86: https://godbolt.org/g/AhdF71 The branchless version very nicely maps to the minss/maxss instructions.

Unfortunately I wasn't able to make godbolt generate ARM assembly from Rust, but here is a ARM comparison of the methods with Clang: https://godbolt.org/g/s7ronw Without being able to test the code and knowing much of ARM: The code size seems smaller too and LLVM mostly generates vmax/vmin which looks promising. Maybe LLVM could be teached eventually to fold most of the code into a single instruction?

hanna-kruppe commented 6 years ago

@ActuallyaDeviloper The asm and the benchmark results look very good! Furthermore, branchless code like yours is probably easier to generate in rustc than the nested conditionals of other solutions (for the record, I am assuming we want to generate inline IR instead of calling a lang item function). Thank you so much for writing this.

I have a question about u16_cast_clip2: it doesn't seem to handle NaN?! There is a comment talking about NaN, but I believe the function will pass NaN through unmodified and attempt to cast it to f32 (and even if it didn't, it would produce one of the boundary values rather than 0).

PS: To be clear, I was not trying to imply that it's unimportant whether the cast can be vectorized. It clearly is important if the surrounding code is otherwise vectorizable. But scalar performance is also important, as vectorization is often not applicable, and the benchmarks I was commenting on were not making any statement about scalar performance. Out of interest, have you checked the asm of the *array* benchmarks to see if they're still vectorized with your implementation?

ActuallyaDeviloper commented 6 years ago

@rkruppe You are right, I accidently swapped the sides of the if and forgot about that. f32 as u16 happend to do the right thing by truncating the upper 0x8000 away, so the tests didn't catch it either. I fixed the problem now by swapping the branches again and testing all methods with if (y.is_nan()) { panic!("NaN"); } this time.

I updated my previous post. The x86 code did not change significantly at all but unfortunately the change stops LLVM from generating vmax in the u16 ARM case for some reason. I assume this has to do with some details about NaN handling of that ARM instruction or maybe it's a LLVM limitation.

For why it works, notice that the lower boundary value is actually 0 for unsigned values. So NaN and the lower bound can be catched at the same time.

The array versions are vectorized. Godbolt: https://godbolt.org/g/HnmsSV

hanna-kruppe commented 6 years ago

Re: the ARM asm, I believe the reason vmax is not used any more is that it returns NaN if either operand is NaN. The code is still branchless, though, it just uses predicated moves (vmovgt, referring to to the result of the earlier vcmp with 0).

For why it works, notice that the lower boundary value is actually 0 for unsigned values. So NaN and the lower bound can be catched at the same time.

Ohhh, right. Nice.

hanna-kruppe commented 6 years ago

I would suggest to move forward by implementing saturating as casts in rustc, behind a -Z flag

I have implemented this and will file a PR once I also fixed #41799 and have a lot more tests.

hanna-kruppe commented 6 years ago

45134 has pointed out a code path that I missed (generation of LLVM constant expressions – this is separate from rustc's own constant evaluation). I'll roll a fix for that into the same PR, but it will take a little while longer.

eddyb commented 6 years ago

@rkruppe You should coordinate with @oli-obk so that miri ends up with the same changes.

hanna-kruppe commented 6 years ago

Pull request is up: #45205

hanna-kruppe commented 6 years ago

45205 has been merged, so anyone can now (well, starting with the next nightly) measure the performance impact of saturation by passing -Z saturating-float-casts via RUSTFLAGS. [1] Such measurements would be very valuable for deciding how to proceed with this issue.

[1] Strictly speaking, this won't affect the non-generic, non-#[inline] portions of the standard library, so to be 100% accurate you'd want to locally build std with Xargo. However, I don't expect that there will be a lot of code affected by this (the various conversion trait impls are #[inline], for example).

Gankra commented 6 years ago

@rkruppe I suggest starting an internals/users page to collect data, in the same vein as https://internals.rust-lang.org/t/help-us-benchmark-incremental-compilation/6153/ (we can then also link people to that, rather than some random comments in our issue tracker)

est31 commented 6 years ago

@rkruppe you should create a tracking issue. This discussion is split up into two issues already. That's not good!

hanna-kruppe commented 6 years ago

@Gankro Yeah I agree, but it may be a few days before I find the time to write that post properly, so I figured I'd solicit feedback from the people subscribed to this issue in the mean time.

@est31 Hmm. Although the -Z flag covers both cast directions (which may have been a mistake, in retrospect), it seems unlikely that we'll flip the switch on both at the same time, and there's little overlap between the two in terms of what must be discussed (e.g., this issue hinges on the performance of saturation, while in #41799 it's agreed upon what the right solution is). It is a bit silly that benchmarks primarily targeted at this issue would also measure the impact of the fix to #41799, but that can at most lead to overreporting of performance regressions, so I'm sort of okay with that. (But if anyone is motivated to split the -Z flag into two, go ahead.)

I've considered a tracking issue for the task of removing the flag once it has outlived its usefulness, but I don't see the need to merging the discussions occuring here and in #41799.

Gankra commented 6 years ago

I have drafted up an internals post: https://gist.github.com/Gankro/feab9fb0c42881984caf93c7ad494ebd

Feel free to copy that, or just give me notes so I can post it. (note I'm a bit confused about the const fn behaviour)