Closed thestinger closed 4 years ago
@CryZe
Wouldn't it make sense to make it behave like overflow? So in a debug build it would panic if you do a cast with undefined behaviour.
This is what I thought initially, but I was reminded that as
conversions never panic at present (we don't do overflow checking with as
, for better or worse). So the most analogous thing is for it to "do something defined".
FWIW, the "kill undef" thing would, in fact, provide a way to fix the memory unsafety, but leaving the result nondeterministic. One of the key components is:
3) Create a new instruction, '%y = freeze %x', that stops propagation of poison. If the input is poison, then it returns an arbitrary, but fixed, value. (like old undef, but each use gets the same value), otherwise it just returns its input value.
The reason undefs can be used to violate memory safety today is that they can magically change values between uses: in particular, between a bounds check and subsequent pointer arithmetic. If rustc added a freeze after every dangerous cast, you would just get an unknown but otherwise well-behaved value. Performance-wise, freeze is basically free here, since of course the machine instruction corresponding to the cast produces a single value, not a fluctuating one; even if the optimizer feels like duplicating the cast instruction for some reason, it should be safe to do so because the result for out-of-range inputs is usually deterministic on a given architecture.
...But not deterministic across architectures, if anyone was wondering. x86 returns 0x80000000 for all bad inputs; ARM saturates for out-of-range inputs and (if I'm reading this pseudocode right) returns 0 for NaN. So if the goal is to produce a deterministic and platform-independent result, it's not enough to just use the platform's fp-to-int intrinsic; at least on ARM you also need to check the status register for an exception. This may have some overhead in itself, and certainly prevents autovectorization in the unlikely event that using the intrinsic didn't already. Alternately, I guess you could explicitly test for in-range values using regular compare operations and then use a regular float-to-int. That sounds a lot nicer on the optimizer…
as
conversions never panic at present
At some point we changed +
to panic (on debug mode). I wouldn’t be shocked to see as
panic in cases that were previously UB.
If we care about checking (which we should), then we should either deprecate as
(is there any use case where it's the only good option?) or at least advise against using it, and move people onto things like TryFrom
and TryInto
instead, which is what we said we were planning to do back when it was decided to leave as
as-is. I don't feel that the cases under discussion are qualitatively different, in the abstract, from the cases where as
is already defined not to do any checks. The difference is just that in practice the implementation for these cases is currently incomplete and has UB. A world where you can't rely on as
doing checks (because for most types, it doesn't), and you can't rely on it not panicking (because for some types, it would), and it's not consistent, and we still haven't deprecated it seems like the worst of all of them to me.
So, I think at this point @jorendorff basically enumerated what seems to me to be the best plan:
as
will have some deterministic behavior;He enumerated three possibilities. I think the remaining work is to investigate those possibilities -- or at least investigate one of them. That is, actually implement it, and try to get some feeling for how "slow" or "fast" it is.
Is there anyone out there who feels motivated to take a stab at that? I'm going to tag this as E-help-wanted
in hopes of attracting some a person. (@oli-obk?)
Uh, I'd rather not pay price for cross-platform consistency :/ It's garbage-in, I don't care what garbage goes out (however a debug assertion would be super helpful).
Currently all rounding/truncating functions in Rust are very slow (function calls with painstakingly precise implementations), so the as
is my last resort hack for fast float rounding.
If you're going to make as
anything more than bare cvttss2si
, please also add a stable alternative that is just that.
@pornel this isn't just UB of the theoretic kind where stuff is okay if you ignore that it is ub, it has real world implications. I've extracted #41799 from a real world code example.
@est31 I agree that leaving it as UB is wrong, but I've seen freeze
proposed as a solution to UB. AFAIK that makes it a defined deterministic value, you just don't get to say which. That behavior is fine with me.
So I'd be fine if e.g. u128::MAX as f32
deterministically produced 17.5
on x86, and 999.0
on x86-64, and -555
on ARM.
freeze
would not produce a defined, deterministic, unspecified value. Its result is still "any bit pattern the compiler likes", and it is consistent only across uses of the same operation. This may sidestep the UB-producing examples people have collected above, but it would not give this:
u128::MAX as f32 deterministically produced 17.5 on x86, and 999.0 on x86-64, and -555 on ARM.
For example, if LLVM notices that u128::MAX as f32
overflows and replaces it with freeze poison
, a valid lowering of fn foo() -> f32 { u128::MAX as f32 }
on x86_64 might be this:
foo:
ret
(that is, just return whatever was last stored in the return register)
I see. That's still acceptable for my uses (for cases where I expect out of range values, I do clamping beforehand. Where I expect values in range, but they aren't, then I'm not going to get a correct result no matter what).
I have no problem with out of range float casts returning arbitrary values as long as the values are frozen so they can't cause further undefined behavior.
Is something like freeze
available on LLVM? I thought that was a purely theoretical construct.
@nikomatsakis I've never seen it used like that (unlike poison
) - it's a planned revamp of poison/undef.
freeze
does not exist at all in LLVM today. It's only been proposed (this PLDI paper is a self-contained version, but it's also been discussed a lot on the mailing list). The proposal seems to have considerable buy-in, but of course that is no guarantee it will be adopted, much less adopted in a timely manner. (Removing the pointee types from pointer types has been accepted for years and it's still not done.)
Do we want to open up an RFC to get more widespread discussion on the changes being proposed here? IMO, anything that potentially impacts the performance of as
is going to be contentious, but it will be doubly contentious if we don't give people the chance to make their voice heard.
I'm a Julia developer, and I've been following this issue for a while, as we share the same LLVM backend and so have similar issues. In case it's of interest, here is what we have settled on (with approximate timings for a single function on my machine):
unsafe_trunc(Int64, x)
maps directly to the corresponding LLVM intrinsic fptosi
(1.5 ns)trunc(Int64, x)
throws an exception for out of range values (3 ns)convert(Int64, x)
throws an exception for out of range or non-integer values (6 ns)Also, I have asked on the mailing list about making the undefined behaviour a little more defined, but did not receive a very promising response.
@bstrie I'm fine with an RFC, but I think it'd definitely be useful to have data! @simonbyrne's comment is super helpful in that regard, however.
I've toyed around with JS semantics (the modulo @jorendorff mentioned) and the Java semantics which appear to be the "saturation" column. In case those links expire it's JS and Java.
I also whipped up a quick implementation of saturation in Rust which I think (?) is correct. And got some benchmark numbers as well. Interestingly I'm seeing that the saturating implementation is 2-3x slower than the intrinsic, which is different from what @simonbyrne found with only 2x slower.
I'm not entirely sure how to implement the "mod" semantics in Rust...
To me, though, it seems clear that we'll need a slew of f32::as_u32_unchecked()
methods and such for those who need the performance.
it seems clear that we'll need a slew of
f32::as_u32_unchecked()
methods and such for those who need the performance.
That's a bummer - or do you mean a safe but implementation-defined variant?
Is there no option for an implementation defined fast default?
@eddyb I was thinking that we'd just have unsafe fn as_u32_unchecked(self) -> u32
on f32
and such which is a direct analog to what as
is today.
I'm certainly not going to claim that the Rust implementation I wrote is optimal, but I was under the impression that when reading this thread determinism and safety were more important than speed in this context most of the time. The unsafe
escape hatch is for those on the other side of the fence.
So there's no cheap platform-dependent variant? I want something that's fast, gives an unspecified value when outside of bounds and is safe. I don't want UB for some inputs and I think that's too dangerous for common use, if we can do better.
As far as I am aware, on most if not all platforms the canonical way to implement this conversion does something to out-of-range inputs that isn't UB. But LLVM does not seem to have any way to pick that option (whatever it may be) over UB. If we could convince LLVM developers to introduce an intrinsic that yields an "unspecified but not undef
/poison
" result on out-of-range inputs, we could use that.
But I'd estimate that someone in this thread would have to write a convincing RFC (on the llvm-dev list), get buy-in, and implement it (in backends we care about, and with a fall-back implementation for other targets). Probably easier than convincing llvm-dev to make the existing casts not-UB (because it side-steps questions like "will this make any C and C++ programs slower"), but still not very easy.
Just in case you will be choosing between these:
Saturation (out-of-range values become IntType::max_value()/min_value()) Modulo (out-of-range values are treated as if by converting to a bigint first, then truncating)
IMO only saturation would make sense here, because absolute precision of floating point quickly drops as values get large, so at some point the modulo would be something useless like all zeroes.
I marked this as E-needs-mentor
and tagged it with WG-compiler-middle
since it seems the impl period might be a great time to investigate this further! My existing notes on the plan of record are pretty sparse though, so it'd be great if someone from @rust-lang/compiler wanted to help elaborate a those a bit further!
@nikomatsakis
IIRC LLVM are planning to eventually implement freeze
, which should allow us to deal with the UB by doing a freeze
.
My results so far: https://gist.github.com/s3bk/4bdfbe2acca30fcf587006ebb4811744
The _array variants run a loop of 1024 values.
_cast: x as i32
_clip: x.min(MAX).max(MIN) as i32
_panic: panics if x is out of bounds
_zero: sets the result to zero if out of bounds
test bench_array_cast ... bench: 1,840 ns/iter (+/- 37)
test bench_array_cast_clip ... bench: 2,657 ns/iter (+/- 13)
test bench_array_cast_panic ... bench: 2,397 ns/iter (+/- 20)
test bench_array_cast_zero ... bench: 2,671 ns/iter (+/- 19)
test bench_cast ... bench: 2 ns/iter (+/- 0)
test bench_cast_clip ... bench: 2 ns/iter (+/- 0)
test bench_cast_panic ... bench: 2 ns/iter (+/- 0)
test bench_cast_zero ... bench: 2 ns/iter (+/- 0)
Perhaps you need not round the results to integer for individual operations. Clearly there must be some difference behind these 2 ns/iter. Or is it really like this, exactly 2 ns for all 4 variants?
@sp-1234 I wonder if it's partially optimized out.
@sp-1234 It is too fast to measure. The non-array benchmarks are basically useless.
If you force the single-value functions to be functions via #[inline(never)]
, you get 2ns vs 3ns.
@arielb1
I have some reservations regarding freeze
. If I understand correctly, a frozen undef
can still contain any arbitrary value, it just won't change between uses. In practice, the compiler will probably reuse a register or stack slot.
However this means that we can now read uninitialized memory from safe code. This could lead to secret data being leaked, somewhat like Heartbleed. It's debatable whether this is truly considered UB from the point of view of Rust, but it clearly seems undesirable.
I ran @s3bk's benchmark locally. I can confirm the scalar versions are optimized out completely, and the asm for the array variants also looks suspiciously well-optimized: for example, the loops are vectorized, which is nice but makes it hard to extrapolate performance to scalar code.
Unfortunately, spamming black_box
doesn't seem to help. I do see the asm doing useful work, but running the benchmark still consistently gives 0ns for the scalar benchmarks (except cast_zero
, which shows 1ns). I see that @alexcrichton performed the comparison 100 times in their benchmarks, so I adopted the same hack. I'm now seeing these numbers (source code):
test bench_cast ... bench: 53 ns/iter (+/- 0)
test bench_cast_clip ... bench: 164 ns/iter (+/- 1)
test bench_cast_panic ... bench: 172 ns/iter (+/- 2)
test bench_cast_zero ... bench: 100 ns/iter (+/- 0)
The array benchmarks vary too much for me to trust them. Well, truth to be told, I'm skeptical of the test
benchmarking infrastructure anyway, especially after seeing the above numbers compared to the flat 0ns I got previously. Furthermore, even just 100 iterations of black_box(x);
(as a baseline) takes 34ns, which makes it even harder to reliably interpret those numbers.
Two points worth noting:
cast_clip
implementation appears to be slower than @alexcrichton's saturating cast (note that their run and mine have roughly the same timing for as
casts, 53-54ns).cast_panic
being slower than the other checked casts. I also see an even greater slowdown on the array benchmarks. Maybe these things are just highly dependent on microarchitectural details and/or optimizer mood?For the record, I've measured with rustc 1.21.0-nightly (d692a91fa 2017-08-04), -C opt-level=3
, on an i7-6700K under light load.
In conclusion, I conclude that don't have reliable data so far and that getting more reliable data seems hard. Furthermore, I strongly doubt that any real application spends even 1% of its wall clock time on this operation. Therefore, I would suggest to move forward by implementing saturating as
casts in rustc, behind a -Z
flag, and then running some non-artificial benchmarks with and without this flag to determine the impact on realistic applications.
Edit: I would also recommend to run such benchmarks on a variety of architectures (e.g., including ARM) and microarchitectures, if at all possible.
I admit I'm not that familiar with rust, but I think this line is subtly incorrect: std::i32::MAX
(2^31-1) is not exactly representable as a Float32, so std::i32::MAX as f32
will be rounded to the nearest representable value (2^31). If this value is used as the argument x
, the result is technically undefined. Replacing with a strict inequality should fix this case.
Yeah, we had exactly that problem in Servo before. The final solution was to cast to f64 and then clamp.
There are other solutions but they're pretty tricky and rust doesn't expose nice APIs for dealing with this well.
using 0x7FFF_FF80i32 as upper limit and -0x8000_0000i32 should solve this without casting to f64. edit: use the correct value.
I think you mean 0x7fff_ff80
, but simply using a strict inequality would probably make the intention of the code clearer.
as in x < 0x8000_0000u32 as f32
? That would probably be a good idea.
I think of all the suggested deterministic options, clamping is geneally most useful one because I think it's done often anyway. If the conversion type would actually be documentated to be saturating, manual clamping would become unnecessary.
I am only a little worried about the suggested implementation because it doesn't properly translate to machine instructions and it relies heavily on branching. Branching makes the performance dependent on specific data patterns. In the test cases given above everything looks (comparatively) fast because the same branch is taken always and the processor has good branch prediction data from many previous loop iterations. The real world will probably not look like that. Additionally the branching hurts the ability of the compiler to vectorize the code. I disagree with the opinion of @rkruppe , that the operation shouldn't also be tested in combination with vectorization. Vectorization is important in high performance code and being able to vectorize simple casts on common architectures should be a crucial requirement.
For the reasons given above, I played around with an alternative branchless and data flow oriented version of @alexcrichton 's cast with saturation semantics and @simonbyrne 's fix. I implemented it for u16, i16 and i32 since they all have to cover slightly different cases which result in varying performance.
The results:
test i16_bench_array_cast ... bench: 99 ns/iter (+/- 2)
test i16_bench_array_cast_clip ... bench: 197 ns/iter (+/- 3)
test i16_bench_array_cast_clip2 ... bench: 113 ns/iter (+/- 3)
test i16_bench_cast ... bench: 76 ns/iter (+/- 1)
test i16_bench_cast_clip ... bench: 218 ns/iter (+/- 25)
test i16_bench_cast_clip2 ... bench: 148 ns/iter (+/- 4)
test i16_bench_rng_cast ... bench: 1,181 ns/iter (+/- 17)
test i16_bench_rng_cast_clip ... bench: 1,952 ns/iter (+/- 27)
test i16_bench_rng_cast_clip2 ... bench: 1,287 ns/iter (+/- 19)
test i32_bench_array_cast ... bench: 114 ns/iter (+/- 1)
test i32_bench_array_cast_clip ... bench: 200 ns/iter (+/- 3)
test i32_bench_array_cast_clip2 ... bench: 128 ns/iter (+/- 3)
test i32_bench_cast ... bench: 74 ns/iter (+/- 1)
test i32_bench_cast_clip ... bench: 168 ns/iter (+/- 3)
test i32_bench_cast_clip2 ... bench: 189 ns/iter (+/- 3)
test i32_bench_rng_cast ... bench: 1,184 ns/iter (+/- 13)
test i32_bench_rng_cast_clip ... bench: 2,398 ns/iter (+/- 41)
test i32_bench_rng_cast_clip2 ... bench: 1,349 ns/iter (+/- 19)
test u16_bench_array_cast ... bench: 99 ns/iter (+/- 1)
test u16_bench_array_cast_clip ... bench: 136 ns/iter (+/- 3)
test u16_bench_array_cast_clip2 ... bench: 105 ns/iter (+/- 3)
test u16_bench_cast ... bench: 76 ns/iter (+/- 2)
test u16_bench_cast_clip ... bench: 184 ns/iter (+/- 7)
test u16_bench_cast_clip2 ... bench: 110 ns/iter (+/- 0)
test u16_bench_rng_cast ... bench: 1,178 ns/iter (+/- 22)
test u16_bench_rng_cast_clip ... bench: 1,336 ns/iter (+/- 26)
test u16_bench_rng_cast_clip2 ... bench: 1,207 ns/iter (+/- 21)
The test was run on an Intel Haswell i5-4570 CPU and Rust 1.22.0-nightly.
clip2
is the new branchless implementation. It agrees with clip
on all 2^32 possible f32 input values.
For the rng
benchmarks, random input values are used that hit different cases often. This uncovers the extreme performance cost (roughly 10 times the normal cost!!!) that occurs if branch prediction fails. I think it is very important to consider this. It's not the average real world performance either, but it's still a possible case and some applications will hit this. People will expect a f32 cast to have consistent performance.
Assmbly comparison on x86: https://godbolt.org/g/AhdF71 The branchless version very nicely maps to the minss/maxss instructions.
Unfortunately I wasn't able to make godbolt generate ARM assembly from Rust, but here is a ARM comparison of the methods with Clang: https://godbolt.org/g/s7ronw Without being able to test the code and knowing much of ARM: The code size seems smaller too and LLVM mostly generates vmax/vmin which looks promising. Maybe LLVM could be teached eventually to fold most of the code into a single instruction?
@ActuallyaDeviloper The asm and the benchmark results look very good! Furthermore, branchless code like yours is probably easier to generate in rustc
than the nested conditionals of other solutions (for the record, I am assuming we want to generate inline IR instead of calling a lang item function). Thank you so much for writing this.
I have a question about u16_cast_clip2
: it doesn't seem to handle NaN?! There is a comment talking about NaN, but I believe the function will pass NaN through unmodified and attempt to cast it to f32
(and even if it didn't, it would produce one of the boundary values rather than 0).
PS: To be clear, I was not trying to imply that it's unimportant whether the cast can be vectorized. It clearly is important if the surrounding code is otherwise vectorizable. But scalar performance is also important, as vectorization is often not applicable, and the benchmarks I was commenting on were not making any statement about scalar performance. Out of interest, have you checked the asm of the *array*
benchmarks to see if they're still vectorized with your implementation?
@rkruppe You are right, I accidently swapped the sides of the if
and forgot about that. f32 as u16
happend to do the right thing by truncating the upper 0x8000 away, so the tests didn't catch it either. I fixed the problem now by swapping the branches again and testing all methods with if (y.is_nan()) { panic!("NaN"); }
this time.
I updated my previous post. The x86 code did not change significantly at all but unfortunately the change stops LLVM from generating vmax
in the u16
ARM case for some reason. I assume this has to do with some details about NaN handling of that ARM instruction or maybe it's a LLVM limitation.
For why it works, notice that the lower boundary value is actually 0 for unsigned values. So NaN and the lower bound can be catched at the same time.
The array versions are vectorized. Godbolt: https://godbolt.org/g/HnmsSV
Re: the ARM asm, I believe the reason vmax
is not used any more is that it returns NaN if either operand is NaN. The code is still branchless, though, it just uses predicated moves (vmovgt
, referring to to the result of the earlier vcmp
with 0).
For why it works, notice that the lower boundary value is actually 0 for unsigned values. So NaN and the lower bound can be catched at the same time.
Ohhh, right. Nice.
I would suggest to move forward by implementing saturating as casts in rustc, behind a -Z flag
I have implemented this and will file a PR once I also fixed #41799 and have a lot more tests.
@rkruppe You should coordinate with @oli-obk so that miri ends up with the same changes.
Pull request is up: #45205
-Z saturating-float-casts
via RUSTFLAGS
. [1] Such measurements would be very valuable for deciding how to proceed with this issue.[1] Strictly speaking, this won't affect the non-generic, non-#[inline]
portions of the standard library, so to be 100% accurate you'd want to locally build std with Xargo. However, I don't expect that there will be a lot of code affected by this (the various conversion trait impls are #[inline]
, for example).
@rkruppe I suggest starting an internals/users page to collect data, in the same vein as https://internals.rust-lang.org/t/help-us-benchmark-incremental-compilation/6153/ (we can then also link people to that, rather than some random comments in our issue tracker)
@rkruppe you should create a tracking issue. This discussion is split up into two issues already. That's not good!
@Gankro Yeah I agree, but it may be a few days before I find the time to write that post properly, so I figured I'd solicit feedback from the people subscribed to this issue in the mean time.
@est31 Hmm. Although the -Z flag covers both cast directions (which may have been a mistake, in retrospect), it seems unlikely that we'll flip the switch on both at the same time, and there's little overlap between the two in terms of what must be discussed (e.g., this issue hinges on the performance of saturation, while in #41799 it's agreed upon what the right solution is). It is a bit silly that benchmarks primarily targeted at this issue would also measure the impact of the fix to #41799, but that can at most lead to overreporting of performance regressions, so I'm sort of okay with that. (But if anyone is motivated to split the -Z flag into two, go ahead.)
I've considered a tracking issue for the task of removing the flag once it has outlived its usefulness, but I don't see the need to merging the discussions occuring here and in #41799.
I have drafted up an internals post: https://gist.github.com/Gankro/feab9fb0c42881984caf93c7ad494ebd
Feel free to copy that, or just give me notes so I can post it. (note I'm a bit confused about the const fn
behaviour)
Status as of 2020-04-18
We intend to stabilize the saturating-float-casts behavior for
as
, and have stabilized unsafe library functions that handle the previous behavior. See #71269 for the latest discussion on that stabilization process.Status as of 2018-11-05
A flag has been implemented in the compiler,
-Zsaturating-float-casts
, which will cause all float to integer casts have "saturating" behavior where if it's out of bounds it's clamped to the nearest bound. A call for benchmarking of this change went out awhile ago. Results, while positive in many projects, are quite negative for some projects and indicates that we're not done here.The next steps are figuring out how to recover performance for these cases:
as
cast behavior (which is UB in some cases) and addunsafe
functions for the relevant types and such.freeze
concept which means that we get a garbage bit pattern, but it's at least not UBOld status
UPDATE (by @nikomatsakis): After much discussion, we've got the rudiments of a plan for how to address this problem. But we need some help with actually investigating the performance impact and working out the final details!
ORIGINAL ISSUE FOLLOWS: