Closed briansmith closed 3 months ago
Maybe it's better to just write the desired assembly?
Maybe it's better to just write the desired assembly?
This pretty much does what we want. There is a longstanding rust-lang/rust issue about controlling unrolling, but with these changes the unrolling isn't much of an issue anymore. There is a place for inline assembly but I think loops are one thing where I'd like to avoid doing them in inline assembly.
Assembly with loop looks simple enough: https://rust.godbolt.org/z/1r87Gc6Kb It's just a quick draft, so it's better to double-check the code. (Also, I am not sure why it inserts seemingly useless push rax
and pop rcx
)
I did some experiments looking at the size of rdrand_exact at different optimization levels. I wanted to exclude the initialization code, as #443 is a better place to discuss inlining of the initialization code. |
Optimization Level | Current Impl | This PR | ASM Loop |
---|---|---|---|---|
opt-level=z |
186 bytes |
197 bytes |
159 bytes |
|
opt-level=s |
147 bytes |
152 bytes |
147 bytes |
|
opt-level=3 |
211 bytes |
194 bytes (131 excluding retry ) |
132 bytes |
So this does help at opt-level=3
, but I'm wondering if it's worth it. All these implementations will have essentially identical performance, and 80 bytes of icache pressure is probably irrelevant compared to the latancy of rdrand
. The main benefit to me is that when we have #288, we will be able to see if the retry logic is covered by tests.
Maybe it's better to just write the desired assembly?
I think we should stick to the intrinsics, they allow LLVM to reason about the carry flag, and avoid error-prone inline assembly.
TBH, I don't have any urgent need for this. I mostly wrote it so I could review the generated assembly easier. At some point I think it would be good to expose the RDRAND implementation on all x86/x86_64 targets via a new API, so that a user can mix getrandom()
and RDRAND
, which is pretty common to do. This is mostly a step towards that; the hope is that the linker will be able to put the non-cold parts of getrandom[_uninit]
and the non-cold parts of rdrand
right next to each other in the same page.
I found a much better approach: https://rust.godbolt.org/z/xW4sWr8hW
It uses the fact that the rdrand
function can not be inlined because of the enabled target feature, so number of retries becomes opaque for the optimizer. We also could mark it with #[inline(never)]
or even #[no_mangle]
to be extra safe. It also may be worth to return Result<Word, Error>
instead of Option
to shave off several instructions.
It uses the fact that the
rdrand
function can not be inlined because of the enabled target feature, so number of retries becomes opaque for the optimizer.
I don't think we need to spend too much more time on this. But I just want to point out that in targets where rdrand
is actually used, at lesat SGX, have rdrand as a statically-enabled feature, so everything gets inlined into getrandom_inner
.
Yes, this is why I mentioned #[inline(never)]
and #[no_mangle]
. From playing in godbolt, the latter looks more robust, but we would need to be careful with the function name to prevent potential collisions.
Feel free to close this PR, if you do not plan to move forward with it.
@newpavlov Your idea of removing the #[target_feature(enable = "rdrand")]
from the callers does seem better than my idea. However, in https://github.com/rust-random/getrandom/issues/230#issuecomment-2143138898 it was suggested that maybe we would offer a public RDRAND API. I think we should have a discussion about whether we want to have such an API, what the design of that public API would be, see if we can implement getrandom_uninit
in terms of that API, and then see if we can avoid the weird codegen in the implementation of that API.
With that in mind, I'm going to close this. If we end up not adding a public rdrand API then we can implement your idea. Or if you want to do it in the interim, I'm happy to review it.
The rdrand implementation contains three calls to rdrand():
In the first case, the loop is unrolled into:
The second case is similar, except it isn't a loop.
In the third case, the self-test loop, the same unrolling happens, but then the self-test loop is also unrolled, so the result is a sequence of 160 instructions.
With this change, the generated code for the loop looks like this:
The generated code for the tail now looks like this:
This is much better because we're no longer jumping over the uselessly- unrolled loops.
The loop in
retry()
still gets unrolled though, but the compiler will put it in the cold function section.Since rdrand will basically never fail, the
jb <success>
in each call is going to be predicted as succeeding, so the number of instructions doesn't change. But, instruction cache pressure should be reduced.