rdrand: Avoid inlining unrolled retry loops.

briansmith commented 4 months ago

The rdrand implementation contains three calls to rdrand():

One in the main loop, for full words of output.
One after the main loop, for the potential partial word of output.
One inside the self-test loop.

In the first case, the loop is unrolled into:

loop:
   ....

   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop
   rdrand <register>
   jb loop

The second case is similar, except it isn't a loop.

In the third case, the self-test loop, the same unrolling happens, but then the self-test loop is also unrolled, so the result is a sequence of 160 instructions.

With this change, the generated code for the loop looks like this:

loop:
        ...

    rdrand <register>
    jb loop
    call retry
    test rax, rax
    jne loop
    jmp fail

The generated code for the tail now looks like this:

        rdrand rdx
    jae call_retry
        ...

This is much better because we're no longer jumping over the uselessly- unrolled loops.

The loop in retry() still gets unrolled though, but the compiler will put it in the cold function section.

Since rdrand will basically never fail, the jb <success> in each call is going to be predicted as succeeding, so the number of instructions doesn't change. But, instruction cache pressure should be reduced.

newpavlov commented 4 months ago

Maybe it's better to just write the desired assembly?

briansmith commented 4 months ago

Maybe it's better to just write the desired assembly?

This pretty much does what we want. There is a longstanding rust-lang/rust issue about controlling unrolling, but with these changes the unrolling isn't much of an issue anymore. There is a place for inline assembly but I think loops are one thing where I'd like to avoid doing them in inline assembly.

newpavlov commented 4 months ago

Assembly with loop looks simple enough: https://rust.godbolt.org/z/1r87Gc6Kb It's just a quick draft, so it's better to double-check the code. (Also, I am not sure why it inserts seemingly useless push rax and pop rcx)

josephlr commented 4 months ago

I did some experiments looking at the size of `rdrand_exact` at different optimization levels. I wanted to exclude the initialization code, as #443 is a better place to discuss inlining of the initialization code.	Optimization Level	Current Impl	This PR
`opt-level=z`	`186` bytes	`197` bytes	`159` bytes
`opt-level=s`	`147` bytes	`152` bytes	`147` bytes
`opt-level=3`	`211` bytes	`194` bytes (`131` excluding `retry`)	`132` bytes

So this does help at opt-level=3, but I'm wondering if it's worth it. All these implementations will have essentially identical performance, and 80 bytes of icache pressure is probably irrelevant compared to the latancy of rdrand. The main benefit to me is that when we have #288, we will be able to see if the retry logic is covered by tests.

Maybe it's better to just write the desired assembly?

I think we should stick to the intrinsics, they allow LLVM to reason about the carry flag, and avoid error-prone inline assembly.

briansmith commented 4 months ago

TBH, I don't have any urgent need for this. I mostly wrote it so I could review the generated assembly easier. At some point I think it would be good to expose the RDRAND implementation on all x86/x86_64 targets via a new API, so that a user can mix getrandom() and RDRAND, which is pretty common to do. This is mostly a step towards that; the hope is that the linker will be able to put the non-cold parts of getrandom[_uninit] and the non-cold parts of rdrand right next to each other in the same page.

newpavlov commented 3 months ago

I found a much better approach: https://rust.godbolt.org/z/xW4sWr8hW

It uses the fact that the rdrand function can not be inlined because of the enabled target feature, so number of retries becomes opaque for the optimizer. We also could mark it with #[inline(never)] or even #[no_mangle] to be extra safe. It also may be worth to return Result<Word, Error> instead of Option to shave off several instructions.

briansmith commented 3 months ago

It uses the fact that the rdrand function can not be inlined because of the enabled target feature, so number of retries becomes opaque for the optimizer.

I don't think we need to spend too much more time on this. But I just want to point out that in targets where rdrand is actually used, at lesat SGX, have rdrand as a statically-enabled feature, so everything gets inlined into getrandom_inner.

newpavlov commented 3 months ago

Yes, this is why I mentioned #[inline(never)] and #[no_mangle]. From playing in godbolt, the latter looks more robust, but we would need to be careful with the function name to prevent potential collisions.

Feel free to close this PR, if you do not plan to move forward with it.

briansmith commented 3 months ago

@newpavlov Your idea of removing the #[target_feature(enable = "rdrand")] from the callers does seem better than my idea. However, in https://github.com/rust-random/getrandom/issues/230#issuecomment-2143138898 it was suggested that maybe we would offer a public RDRAND API. I think we should have a discussion about whether we want to have such an API, what the design of that public API would be, see if we can implement getrandom_uninit in terms of that API, and then see if we can avoid the weird codegen in the implementation of that API.

With that in mind, I'm going to close this. If we end up not adding a public rdrand API then we can implement your idea. Or if you want to do it in the interim, I'm happy to review it.

rust-random / getrandom

rdrand: Avoid inlining unrolled retry loops. #444