CHANGE: Adjust `BlockRngCore` to have a `generate()` method that takes a `*mut u8`

nstilt1 commented 11 months ago

Summary

Changing BlockRngCore wouldn't have any noticeable API changes for the end user, but it would require a little bit of unsafe code for libraries to adjust to this change.

Details

BlockRngCore would likely need to change to something like

pub trait BlockRngCore {
    /// Results element type, e.g. `u32`.
    type Item;
    /// The minimum blocksize for the BlockRng in bytes
    const BLOCK_SIZE: usize;

    /// A function to help determine the optimal buffer size at runtime.
    /// If there is only one size, it can just return that size. Measured
    /// in items
    fn get_buffer_size() -> usize;

    /// A function to determine how many blocks fit in the buffer
    fn get_buf_blocks(results: &Self::Results) -> usize;

    /// Results type. This is the 'block' an RNG implementing `BlockRngCore`
    /// generates, which will usually be an array like `[u32; 16]`.
    type Results: AsRef<[Self::Item]> + AsMut<[Self::Item]> + Default;

    /// A pointer to the results type, this is probably unnecessary.
    fn get_results_mut_ptr(&mut self, results: &mut Self::Results) -> *mut u8 {
        results.as_mut().as_mut_ptr() as *mut u8
    }

    /// Generate `num_blocks` blocks of results, and blindly writes them
    /// to the `dest_ptr`
    ///
    /// Safety: `generate()` needs to write exactly `num_blocks * Self::BLOCK_SIZE`
    /// bytes to the `dest_ptr`
    unsafe fn generate(&mut self, dest_ptr: *mut u8, num_blocks: usize);
}

Then fill_bytes() and generate_and_set()

/// Generate a new set of results immediately, setting the index to the
/// given value.
#[inline]
pub fn generate_and_set(&mut self, index: usize) {
    assert!(index < <R::get_buffer_size());
    let results_ptr = self.core.get_results_mut_ptr(&mut self.results);
    let num_blocks = R::get_buf_blocks();
    unsafe {
        self.core.generate(results_ptr, num_blocks);
    }
    self.index = index;
}
#[inline]
fn fill_bytes(&mut self, dest: &mut [u8]) {
    let dest_len = dest.len();
    // calculate the remaining bytes to fill from `self.buffer`,
    // when indexed by `u32`s
    let buffer_size = <R as BlockRngCore>::get_buffer_size();
    let remaining_bytes = ((buffer_size << 2) - (self.index << 2)).min(dest_len);

    let mut dest_pos = 0;
    if remaining_bytes != 0 {
        // fills with as many bytes from the currently generated buffer as necessary
        if self.index < buffer_size {
            let (consumed_u32, filled_u8) =
                fill_via_u32_chunks(&mut self.results.as_mut()[self.index..buffer_size], &mut dest[0..]);
            self.index += consumed_u32;
            dest_pos += filled_u8;

            if dest_len == dest_pos {
                return;
            }
        }
    }

    let remaining_blocks = (dest_len - dest_pos) / R::BLOCK_SIZE;

    // SAFETY: This only writes to indices that have not yet been written
    // to, and we have determined how many blocks are remaining.
    unsafe {
        let mut dest_ptr= dest.as_mut_ptr();
        dest_ptr = chunk_ptr.add(dest_pos);
        self.core.generate(dest_ptr, remaining_blocks);
    }

    dest_pos += R::BLOCK_SIZE * remaining_blocks;

    if dest_pos == dest_len {
        // index is still at max at this point; no need to refill buffer
        return;
    }
    // refill buffer before filling the tail
    self.generate_and_set(0);

    let (consumed_u32, _filled_u8) = fill_via_u32_chunks(
        &mut self.results.as_mut()[self.index..],
        &mut dest[dest_pos..],
    );
    self.index = consumed_u32;
}

With get_buffer_size(), it might help for when a larger buffer is allocated at compile time, and just using its return value as the maximum.

Motivation

For BlockRngs that use SIMD backends, such as ChaCha20, there was a 5%-7% increase in performance on AVX2 when using unsafe fn generate(&mut self, dest_ptr: *mut u8, num_blocks: usize). The performance increase seems to be mostly caused by the generate function only running some avx2 setup code once to write num_blocks blocks instead of running it once every 4 blocks. There is a smaller performance gain from using pointers, but in order to use the num_blocks parameter, it seems that pointers might be a prerequisite.

This could also be used similarly to a proposal for fn rand::secret_rng since it is capable of skipping the internal buffer and writing the output to the input array, although this will only behave that way for full blocks.

Regarding @newpavlov's comment from a while ago:

The current design of BlockRngCore (and the proposed generic_const_exprs-based version) is not ideal. One issue for which I don't have a good solution is variability of block size dependent on runtime target feature detection. For example, ChaCha has different optimal block sizes dependent on whether hardware has SIMD extensions. Right now we simply use the largest supported block size, which is a bit suboptimal.

It may also be possible to change BlockRngCore to allow for variable sized buffers with a generate() method that takes a pointer. Currently, I've got a branch of chacha20 where the size of the buffer can be determined at compile time. However, if the code was built for AVX2 targets, and then a machine that didn't have AVX2 ran it, the machine would still use a larger buffer. The code won't break, but the end users in this scenario would be using it with a larger buffer. So this solution is not quite ideal.

Alternatives

Which alternatives might be considered, and why or why not? I don't know if there is an alternative with less unsafe code that is capable of writing to [u32] and [u8]. Slices don't seem to be fully compatible. Another alternative would be to not switch to unsafe fn generate(). It seems like the most significant performance increase is only on AVX2, ~unless the NEON backend's setup code were to be more isolated~. It's also only a 5%-7% difference.

Edit: I tried to make this a little more clear. I have also restructured the neon code, and I can confirm that there was a near 0% performance difference for neon as a result of the num_blocks parameter.

nstilt1 commented 10 months ago

CPU-specific optimizations can kind of destroy the need to use pointers for increased performance, so I would say that this isn't necessary. Plus, a "safe" generate method can still be used to achieve the compile-time variable-sized buffer.

If anybody would be still be interested in having a fill_bytes() method for filling arrays using pointers for generic targets, it might be feasible without breaking changes using an additional trait, but I don't know exactly how that would look as of yet. It seems that should belong to a different issue though, as the name and original post isn't quite aligned with that.

dhardy commented 10 months ago

Personally I'm not really convinced by arguments of 5% extra performance in benchmarks anyway — in many cases the real-world benefit is less significant than the costs of an unsafe API. (Similarly we've had a couple of requests to support MaybeUninit in fill_bytes / generate, which we never implemented.)

So while the above proposal might still have merit, substantial argument/benefit would be required to justify the added complexity and usage of unsafe.

rust-random / rand