Closed nstilt1 closed 10 months ago
CPU-specific optimizations can kind of destroy the need to use pointers for increased performance, so I would say that this isn't necessary. Plus, a "safe" generate method can still be used to achieve the compile-time variable-sized buffer.
If anybody would be still be interested in having a fill_bytes()
method for filling arrays using pointers for generic targets, it might be feasible without breaking changes using an additional trait, but I don't know exactly how that would look as of yet. It seems that should belong to a different issue though, as the name and original post isn't quite aligned with that.
Personally I'm not really convinced by arguments of 5% extra performance in benchmarks anyway — in many cases the real-world benefit is less significant than the costs of an unsafe
API. (Similarly we've had a couple of requests to support MaybeUninit
in fill_bytes
/ generate
, which we never implemented.)
So while the above proposal might still have merit, substantial argument/benefit would be required to justify the added complexity and usage of unsafe
.
Summary
Changing
BlockRngCore
wouldn't have any noticeable API changes for the end user, but it would require a little bit ofunsafe
code for libraries to adjust to this change.Details
BlockRngCore
would likely need to change to something likeThen
fill_bytes()
andgenerate_and_set()
With
get_buffer_size()
, it might help for when a larger buffer is allocated at compile time, and just using its return value as the maximum.Motivation
For BlockRngs that use SIMD backends, such as ChaCha20, there was a 5%-7% increase in performance on AVX2 when using
unsafe fn generate(&mut self, dest_ptr: *mut u8, num_blocks: usize)
. The performance increase seems to be mostly caused by the generate function only running some avx2 setup code once to writenum_blocks
blocks instead of running it once every4
blocks. There is a smaller performance gain from using pointers, but in order to use thenum_blocks
parameter, it seems that pointers might be a prerequisite.This could also be used similarly to a proposal for
fn rand::secret_rng
since it is capable of skipping the internal buffer and writing the output to the input array, although this will only behave that way for full blocks.Regarding @newpavlov's comment from a while ago:
It may also be possible to change
BlockRngCore
to allow for variable sized buffers with agenerate()
method that takes a pointer. Currently, I've got a branch ofchacha20
where the size of the buffer can be determined at compile time. However, if the code was built for AVX2 targets, and then a machine that didn't have AVX2 ran it, the machine would still use a larger buffer. The code won't break, but the end users in this scenario would be using it with a larger buffer. So this solution is not quite ideal.Alternatives
Which alternatives might be considered, and why or why not? I don't know if there is an alternative with less
unsafe
code that is capable of writing to[u32]
and[u8]
. Slices don't seem to be fully compatible. Another alternative would be to not switch tounsafe fn generate()
. It seems like the most significant performance increase is only on AVX2, ~unless the NEON backend's setup code were to be more isolated~. It's also only a 5%-7% difference.Edit: I tried to make this a little more clear. I have also restructured the neon code, and I can confirm that there was a near 0% performance difference for neon as a result of the
num_blocks
parameter.