Closed Centril closed 1 year ago
I believe the main implementation work remaining is to enhance the documentation of hint::black_box
as well as to rename it.
An idea I wish I had earlier (but we can still try it out):
wasm
/miri
support)--test
crate:
Note that because this is a warning, and not an error, it doesn't add to the dreaded post-monomorphization errors. We could also just do 2. but not 1, if we need to.
#[inline]
and/or generics could then be used to delay the codegen until the --test
crate, so it can do its intended purpose.
Would be interesting to know what the benchmarking frameworks out there use, in terms of test runners, and if they end up in a rustc --test
crate (i.e. via cargo test
) or just a plain executable.
The term "fence" may also be appropriate, as it doesn't prevent optimisation of the value inside it, but rather, doesn't allow optimisations to cross the barrier.
For example, vec.push(bench_black_box(convoluted_way_to_compute_zero()))
could still be optimised to vec.push(bench_black_box(0))
, but vec.push
couldn't use the knowledge that the value is zero to optimise itself.
Another argument in favor of black_box
: not only is it a common term in computer programming, but already used a lot in all kinds of documentation/resources about Rust. For example, benchmarking libraries such as Criterion use black_box
and plenty of StackOverflow questions and answers use black_box
. Also, I doubt that anyone would really think of Box
when reading black_box
, and even if, this confusion should be resolved quickly and I can't see how it would lead to bugs. I don't really see a good reason to change an established name for a questionable benefit.
I don't want to retreat ground that has been covered to exhaustion and beyond in the RFC discussion, but since all discussion here including the issue text is not even mentioning the key point that was blocking the black_box
name it seems necessary to quote it again:
I'm strongly opposed to adding a function to the language called black_box that doesn't guarantee the expected semantics of a black_box function in other languages and systems. There are a number of important unsafe operations that rely on the ability to create a "black box" from the compiler to show that e.g. a pointer escapes or some other operation is not elided.
The semantics this concern was in reference to is (roughly) what's in the accepted RFC: this function is not actually guaranteed to be a "black box" to anyone, least of all the optimizer -- more explicitly: it is allowed to be implemented as completely transparent identity function, and behaves as such for purposes of e.g. whether a program is UB. Assuming nobody wants to re-litigate these semantics (doesn't seem like it so far, thank gods), anyone who wants the black_box
name to be accepted has to convincingly argue that the aforementioned concern is actually a non-issue for some reason. You're invited to do that, but be aware that quibbling over e.g. the potential for confusion with Box
isn't going to help with that.
By the way, the same concern probably applies to most names mentioned as alternatives so far, as they all stress the "blocks optimizations / blinds the compiler" angle without simultaneously highlighting that it's is only suitable for benchmarks, where surprisingly aggressive optimizations only lead to confusing results rather than UB or other serious wrongness.
const fn: it is unclear whether bench_black_box should be a const fn. If it were, that would hint that it cannot have any side-effects, or that it cannot do anything that const fns cannot do.
We can implement this using the approach that kind of got consensus in https://github.com/rust-lang/rust/pull/64683:
fn bench_black_box_rt<T>(x: T) -> T { asm!("....") }
const fn bench_black_box_ct<T>(x: T) -> T { x }
pub const fn bench_black_box<T>(x: T) {
const_eval_select(bench_black_box_ct, bench_black_box_rt)
}
cc @Ralf @oli-obk
(Hello. I ended up reading this issue because I tried to solve the problem black_box
is aimed at a different way, and found some to-me-surprising compiler behaviour - #74476. @the8472 helpfully referred me to black_box
.)
Assuming nobody wants to re-litigate these semantics (doesn't seem like it so far, thank gods),
Sorry to pick up on this. I definitely do not want to re-litigate a decision which has already been made. However:
I skimread a lot of these discussions including https://github.com/rust-lang/rust-memory-model/issues/45 . Much of the compiler internals went over my head; and as a result I wasn't able to see why this had been decided this way. Of course I looked in the RFC which ought to mention this under Alternatives, but it doesn't seem to discuss it at all.
It seems to me that regardless of whether the semantics question is regarded as settled, it needs to be addressed in the RFC text. Especially if it was controversial! I.e. the Alternatives should have a section explaining why the proposal does not provide a useful semantic guarantee. The best description of the most desirable guarantee seems to me to be in https://github.com/rust-lang/rfcs/issues/1484#issuecomment-182392175 or https://github.com/rust-lang/rust-memory-model/issues/45#issuecomment-374369870. Therefore it would probably be best for the RFC to quote one of those, or to paraphrase them, acknowledge that that would be the most useful thing to provide, and then explain why the RFC proposes something different.
Since the existing RFC was merged, maybe this would need to be done as part of an attempt at stabilisation.
With the current very weak semantics, I strongly agree with @hanna-kruppe's concerns over the misleading name.
I also agree with @hanna-kruppe that black_box
is probably a misleading name for a function with the semantics in the RFC. To try and move discussion forward, here are some alternative suggestions that focus specifically on what the function does, rather than what it's "used for" or what it might do:
assume_used
mock_use
use_and
use_then
emulate_use
mark_used
I particularly like std::hint::assume_used
. For the RFC example, this (I think) reads quite well (and accurate):
fn push_cap(v: &mut Vec<i32>) {
for i in 0..4 {
assume_used(v.as_ptr());
v.push(bench_black_box(i));
assume_used(v.as_ptr());
}
}
It also works even when used as an identity function:
let x = assume_used(x);
Is still fairly clear doesn't change the value in any way.
I went ahead and created a PR that renames the method to assume_used
in https://github.com/rust-lang/rust/pull/74853 to encourage further directed discussion.
@jonhoo I wonder how would it go along with other assume-like functions in
the standard library. There is assume
intrinsic and a family of assume_init
functions, for those it is responsibility of the caller to guarantee the thing
being assumed. In the proposed assume_used
, the role of the callee.
@tmiasko I replied to your comment over in https://github.com/rust-lang/rust/pull/74853#issuecomment-665019065. Figured it'd be good to keep all the discussion about this particular name suggestion to one PR.
I was seeing some weird effects from using black_box
in conjunction with counting instructions, and then I remembered that black_box
's current incarnation has to place the value on the stack.
So I tried making my own, and ended up with this, which is sadly limited to values that fit in a register (godbolt):
pub fn better_black_box(mut x: u64) -> u64 {
unsafe { asm!("/* {x} */", x = inout(reg) x); }
x
}
From my brief testing, it seems to result in 3 less instructions than std::hint::black_box
, at this time, as counted by the CPU (what perf stat
calls instructions:u
, although I'm using rdpmc
directly to get my data).
Might be interesting to see if we could specialize the version in libstd (based on the type) so that it's cheaper, for at least integers, but maybe there's not really any point to that. cc @Amanieu
You don't even need specialization, you can just check size_of::<T>
to see if it fits in a usize
and then transmute_copy
it. (godbolt)
You need to avoid the case when T
may have destructors, as transmute_copy
will unsafely duplicate the value, but other than that, seems like it could work great, thanks! (we could presumably also have cases for types smaller than usize
, or types the size of two usize
s, like &[T]
, as long as we don't introduce extra work beyond forcing the values into registers)
Adding options(nostack)
removes the push
and pop
instructions as well:
asm!("/* {x} */", x = inout(reg) int_x, options(nostack));
example::better_black_box:
mov rax, rdi
#APP
#NO_APP
ret
My latest suggestion from #74853 was consume_and_produce
. The idea being that it is a function that consumes the given value in some unspecified way, and produces a value in some unspecified way (that happens to be identical to the consumed value).
This hasn't seen some activity in a while. Has an acceptable default implementation for this been found? It looks like the latest "best implementation" would be a combination of the proposals by @eddyb, @Amanieu, and @m-ou-se
I made some modifications and created the following godbolt:
fn better_black_box<T>(mut x: T) -> T {
unsafe {
if mem::size_of::<T>() <= mem::size_of::<usize>() {
let man_x = mem::ManuallyDrop::new(x);
let mut int_x: usize = mem::transmute_copy(&man_x);
asm!("/* {x} */", x = inout(reg) int_x, options(nostack));
x = mem::transmute_copy(&int_x);
} else {
asm!("/* {x} */", x = in(reg) &mut x);
}
}
x
}
It's a small modification from their code that makes it work for anything that'll fit in a usize
(not sure if it's completely sound), and also fixes the destructor issue that @eddyb brought up, I think.
Has any consensus been come to on the name? If not, I'd like to humbly propose my own option: bench_identity
. I think this name works well as it describes the actual behavior of the function (it behaves as an identity) and the purpose of the function (its use in benchmarking).
let mut int_x: usize = mem::transmute_copy(&man_x);
This line is unsound if x
is smaller than usize
, see the documentation for transmute_copy
.
@Amanieu that's what I figured. Would we then need to do a separate branch for each integer size? Is there any way to get it to work for 3,5,6,7-byte types? I tried using a byte-array but you can't pass that directly to asm. Perhaps I could
I'll have to try that out. Would that be sound?
I'd just use the fallback path if it doesn't fit in a usize
exactly.
Well as an exercise I created a version which works for any size up to 8. godbolt
For 3,5,6,7 it only saves 1 instruction, but it saves 5 stack operations. Not sure how much of a difference that makes.
Keep in mind that your code is a very artificial benchmark. In practice, the different components of a struct are likely to be in separate registers. In that case it is usually faster to store everything to memory than to do bit shuffling to fit everything in one register.
Oh absolutely. It would certainly require some more realistic benchmarks to confirm this as a viable optimization. It was just fun to implement as an exercise.
Would there be a problem with making black_box
always take its argument by reference? That way, it could always generate optimal code (https://github.com/rust-lang/rust/issues/64102#issuecomment-673921803) and would have fewer performance pitfalls (for example, passing an array to it currently results in a memcpy
call).
The rust cryptography working group were saying that they could achieve C-level performance with their algorithms if a black box function was stabilized. That seemed to be the MVP. ( https://www.youtube.com/watch?v=uSqXIPWr4-4 )
That doesn't quite sound right... this black box function being tracked here is purely advisory, it does not force the compiler to do anything -- so relying on it in a crypto implementation seems problematic?
That doesn't quite sound right... this black box function being tracked here is purely advisory, it does not force the compiler to do anything -- so relying on it in a crypto implementation seems problematic?
Yes. This is a weakness with the current definition. IMO the right answer is to give the crypto folks what they are asking for. I don't think there is any difficulty with specifying what that would be. The difficulty seems to be with LLVM, and that may be too hard to surmount :-/.
I think the questions are:
That doesn't quite sound right... this black box function being tracked here is purely advisory, it does not force the compiler to do anything
Even benchmarking people don't want an "advisory" function. They want one that does what is requested, reliably, just like the crypto people. (I am in both sets of people.) That doesn't mean that a single solution needs to work for both benchmarking and crypto though.
The difficulty seems to be with LLVM, and that may be too hard to surmount :-/.
My understanding is that the LLVM people within Google and the crypto people within Google do some work to ensure that clang compiles BoringSSL's "black box" function in a way that works:
crypto_word_t
is usize
. For crypto, a fn (usize) -> usize
black-box function is sufficient to implement everything else. IMO the best solution is to implement it as an intrinsic in LLVM so that LLVM team keeps it working, and then expose that intrinsic through an always-inlined function in libcore.
I suggest we split out the talk of crypto stuff into its own RFC. I think a crypto black box function would be more limited in scope (e.g. less generic w.r.t. which types it operates on) and would need a more rigorous implementation and testing strategy. Even if in the end hint::bench_black_box
and any crypto black box were to end up being implemented the same way under the covers, having them as separate APIs makes the users' intents more clear.
I don't expect that a programmer ever wants the advisory-only version.
Even benchmarking people don't want an "advisory" function. They want one that does what is requested, reliably, just like the crypto people.
I disagree. For microbenchmarks it would certainly be nice to have a function that's guaranteed to work. But practically speaking it's not necessary. At the end of the day when results look wonky I run objdump
or look at the assembly in perf
to see what's happening and when entire functions which I want to benchmark get replaced with a constant I know that I have to adjust the benchmark. If black_box
doesn't work in that case it would be quite visible.
And my understanding is that the "best effort" aspect is mostly a specification issue because it can't be guaranteed for all possible backends and future versions thereof. Today it works 100% reliably (as far as I know) with current LLVM versions. Which is good enough for running all benchmarks today. I don't need a guarantee that the benchmarks will continue do to exactly the same thing next year because they're developer tooling, not part of the API. After all LLVM might start optimizing away different parts of the benchmark next year anyway, places which I didn't need to black_box
today.
Benchmark and cryptography requirements are not the same.
IMO the best solution is to implement it as an intrinsic in LLVM so that LLVM team keeps it working
That wouldn't necessarily help with optimizing wasm runtimes. Or other backends such as GCC and cranelift. To be part of core it would have to be something that can be guaranteed for all backends.
That wouldn't necessarily help with optimizing wasm runtimes. Or other backends such as GCC and cranelift. To be part of core it would have to be something that can be guaranteed for all backends.
Each backend would need to provide such an intrinsic, in order for these things to work reliably. A backend might not do a good job implementing the black box but that would be a bug in that backend. Perhaps not a high-priority one, and not a high severity one (for benchmarking) but a bug nonetheless.
I don't think we can meaningfully distinguish between a guarantee that we already know might end up broken and where fixes may not be a priority and a guarantee that crypto libs can rely on. We have to document it as best-effort, which is the status quo.
Instead of providing something which is documented to maybe not do anything, perhaps it would be better to provide something which is documented to work, but only on the platforms where we can make it work.
[edited to add missing "not"]
And, I should say, then we could perhaps stabilise it when it's available for supported platforms.
LLVM's profile-guided optimization has support for value profiling which records the range of values that a variable holds. Even if you hide a constant behind inline assembly, the optimizer can still use the profiling information to create a "fast path" if the runtime value coming out of the inline assembly matches a known constant.
My point is, the current black_box
intrinsic does not provide the guarantees that you think it does and neither does BoringSSL's value_barrier
. I don't think it is possible to guarantee constant-time execution in LLVM's programming model, you have to write assembly directly for such a guarantee.
@Amanieu those are good points. In the case of bench_black_box
I think the issue is, should bench_black_box
be stabilized before any/all built-in support for making it work reliably is implemented in back-ends. I don't have strong feelings on that, but I'd bias towards having something that sometimes doesn't work, and then incrementally improving on it. I think things would be different for a security-oriented variant. Again, more reason to treat these two uses cases separately.
Yes. This is a weakness with the current definition. IMO the right answer is to give the crypto folks what they are asking for. I don't think there is any difficulty with specifying what that would be. The difficulty seems to be with LLVM, and that may be too hard to surmount :-/.
I think you are underestimating how difficult this is to specify. I literally don't know how to do it, and I do not think it has ever been done, in academia or otherwise. The current definition is a compromise because so far no proposal for a stronger specification has surfaced, at least nothing I would consider a proper formal spec.
See the RFC thread for more details -- it's not worth rehashing this discussion, I think.
I have some kind of an idea of what crypto people do want (assuming this is about what is called "constant time" in the crypto context). I also have ideas for how to achieve that properly, with a formal spec. But it literally requires changes to all layers from the assembly language (having CPUs guarantee that certain instructions are constant time) through LLVM IR and MIR to the surface language. Using black_box for this purpose is a bad hack that happens to work in many cases, but a proper solution to this problem will look very differently (a proper solution involves the compiler knowing which values are secret so it can make sure not to branch on secrets or do memory accesses based on them; any "solution" that involves hacking around the compiler instead of working with it is not fundamentally solving anyhing). I am happy to discuss issues around constant-time, but please not in this thread.
Picking up an earlier conversation:
By the way, the same concern probably applies to most names mentioned as alternatives so far, as they all stress the "blocks optimizations / blinds the compiler" angle without simultaneously highlighting that it's is only suitable for benchmarks, where surprisingly aggressive optimizations only lead to confusing results rather than UB or other serious wrongness.
Would adding a hedge word do the job? maybe_black_box
, try_pessimize
or similar.
And my understanding is that the "best effort" aspect is mostly a specification issue because it can't be guaranteed for all possible backends and future versions thereof. Today it works 100% reliably (as far as I know) with current LLVM versions. Which is good enough for running all benchmarks today. I don't need a guarantee that the benchmarks will continue do to exactly the same thing next year because they're developer tooling, not part of the API. After all LLVM might start optimizing away different parts of the benchmark next year anyway, places which I didn't need to black_box today.
The issue is mostly that it is really hard to capture what black_box is supposed to do. Like most programming languages, the Rust spec uses an "as-if" rule -- if two programs are observably doing the same, they are considered equivalent. And the definition of "observation" here is solely about system calls (and volatile accesses). So it is entirely meaningless to claim guarantees about how long a function takes to run, or whether some memory access depends on a secret -- those things are not observable, and hence nothing is guaranteed about them. It is fundamentally impossible to overcome this limitation in a local way, affecting just the spec of a single function. Changing this would require starting at the bottom of the Rust spec to make more things observable, but of course most of the time we do not want these things to be observable. So we need to introduce new distinctions here to make just enough things observable such that we can make statements like "no memory access depends on a secret".
To try to use an analogy: all of Rust is, by design, color blind. Every part of the compiler may change the color of the output arbitrarily, as long as the brightness stays the same. You can't just make a single function color-aware here, since the rest of the compiler doesn't even have the concept of caring about colors. This would be like talking about the colors in a black-and-white image, it is entirely meaningless. Instead you have to selectively introduce color-awareness on every stage of the pipeline, and then you can start talking about colors in the spec of a function like black_box.
My comment there was about the needs of benchmarking, not crypto. In benchmarks we don't really care about constant time execution or branching on secret memory. It's mostly about hiding value-range information or code being dead from the compiler.
But even for this more lax case we neither have a spec nor commitment from backends to support it in the future.
Optimizations are likewise not part of what is considered "observable" -- that's the only reason why the compiler is even allowed to just change your code to do something else (and hopefully faster): as long as the sequence of syscalls issued by the program stays the same, everything is fair game.
So, neither the presence nor absence of any optimization is actually "observable" in this sense. Hence black_box has no observable effect. What would it even mean to "not optimize" some code? Do we guarantee a particular (naive) mapping from MIR to assembly instructions? I doubt that's what anyone wants. But we can still say that we do our best to shut down optimizations, we just can't really phrase this as a guarantee that one could, say, use in a formal proof. This is exactly what the "best effort" definition attempts to do.
Does it require syscalls though? Shared memory is also externally observable and I hope llvm provides explicit guarantees around that. So instead of the asm!
approach we could also allocate a shared memory region and launder values through that (atomic? volatile?), although that might have more overhead than passing through registers.
I assume you mean memory shared across process boundaries? (For many people including me, "shared memory" means memory shared across thread boundaries.)
Well technically I think non-volatile accesses are not observable no matter the underlying memory, and compilers certainly optimize like that. But it turns out that in practice, if you have to keep other threads inside the same process happy (which you do anyway no matter whether memory is externally observable or not), you also do everything you need to do to keep other processes happy. I am not aware of any proper treatment of this, though.
And anyway, none of this makes "laundering" meaningful in a formal way.
This is basically the same discussion as for N2682 memset_explicit
.
In the C case, the two main avenues discussed to do so have been "intent semantics" (i.e. say what is expected, without any particular guarantee) and "implementation-defined semantics" (i.e. read your vendor manual to see if this is supposed to work).
And yes, formalizing unobservable events w.r.t. the abstract machine is not what the C/C++/Rust specs are good at, and doing it properly may require another model.
Nevertheless, a solution is needed. Leaving it up to users to do whatever hack they end up with is strictly worse than letting the compiler know what is going on -- even if the compiler implements the very same thing users would have done anyway, and even if the compiler still fails to do what the user wants in some cases.
Instead, providing a way to express that gives a chance to the compiler to do the right thing. Plus it ends up being more portable, intent is clear for readers, users don't need a hack that may break in a future compiler version only, the compiler may get better in the future even if it does not work for all cases today, etc.
In fact, by providing a way to inform the compiler about it, then the vendor can ignore users' support requests about other hacks saying "well, you did not use magic_function
; please use that one and see if it works for you"... ;)
Using volatile can ensure that a value gets a real address at some point, but that's about as much as it can help with black boxing. The compiler could still pre-compute a full expression and just const fold it into a single value that then gets an address which is accessed with volatile.
I assume you mean memory shared across process boundaries? (For many people including me, "shared memory" means memory shared across thread boundaries.)
Yes.
Well technically I think non-volatile accesses are not observable no matter the underlying memory, and compilers certainly optimize like that. But it turns out that in practice, if you have to keep other threads inside the same process happy (which you do anyway no matter whether memory is externally observable or not), you also do everything you need to do to keep other processes happy. I am not aware of any proper treatment of this, though.
AIUI a compiler could try to reason about whether other threads can access the memory location used for atomic accesses and if not treat it like local accesses, i.e. a sufficiently smart compiler could turn atomics into plain accesses which then get optimized further, defeating the black box.
It can't do that for memory shared across process boundaries, otherwise IPC can't work. So there has to be something inhibiting that optimization.
Using volatile can ensure that a value gets a real address at some point, but that's about as much as it can help with black boxing. The compiler could still pre-compute a full expression and just const fold it into a single value that then gets an address which is accessed with volatile.
I meant first writing to shared memory and then reading back from it, so the compiler would have to assume that it could have been externally modified in the meantime even though in practice the hardware would see that the memory is uncontended and can do store buffer forwarding.
I've no idea how you'd specify shared memory in core
, though I'm sure you could get some in alloc
and/or std
.
This is a tracking issue for the RFC
std::hint:_black_box
.Original RFC: RFC 2360
Public API:
Steps:
Unresolved questions:
[ ]
const fn
: it is unclear whetherbench_black_box
should be aconst fn
. If it were, that would hint that it cannot have any side-effects, or that it cannot do anything thatconst fn
s cannot do.[ ] Naming: during the RFC discussion it was unclear whether
black_box
is the right name for this primitive but we settled onbench_black_box
for the time being. We should resolve the naming before stabilization.Also, we might want to add other benchmarking hints in the future, like
bench_input
andbench_output
, so we might want to put all of this into abench
sub-module within thecore::hint
module. That might be a good place to explain how the benchmarking hints should be used holistically.Some arguments in favor or against using "black box" are that:
_box
has nothing to do withBox
orbox
-syntax, which might be confusingAlternative names suggested:
pessimize
,unoptimize
,unprocessed
,unknown
,do_not_optimize
(Google Benchmark).