Optimise .fill() throughput

Looking at the fill method, I noticed there was some room for optimisation, and also removing unnecessary unwrap() from the hot loop. Also, the remainder portion was generating new entropy blocks for every u8 portion, when any remaining slice length would always be less than 8 bytes, or one u64 block (which is what WyRand generates per new state), Therefore, a maximum of seven u8 calls can be reduced to just one u64 block.

Afterwards, I simplified the copying of entropy to make use of copy_from_slice, since we know that either the slices will match (and need no try_into conversions), or that the length of the input will always be smaller than the target.

I then benchmarked the resulting code, on my AMD Ryzen 5 Pro 2500U (4C/8T 2,0Ghz base, 3.6Ghz max) with 16GB RAM.

Before:

running 2 tests
test fill             ... bench:          68 ns/iter (+/- 1)
test fill_naive       ... bench:         471 ns/iter (+/- 9)

After:

running 2 tests
test fill             ... bench:          64 ns/iter (+/- 1)
test fill_naive       ... bench:         485 ns/iter (+/- 15)

I was getting 68-70ns before, and now 64-65ns afterwards. Now, I added the #[inline] annotation out of curiosity and tested it, and was getting even more perf as a result:

After with #[inline]

running 2 tests
test fill             ... bench:          56 ns/iter (+/- 1)
test fill_naive       ... bench:         472 ns/iter (+/- 14)

So it might be advantageous to include it. So end result with all changes here is we've gone from 68-70ns to 64-65ns to then 56ns.

smol-rs / fastrand

Optimise .fill() throughput #43