Replacing rand_chacha with chacha20

dhardy commented 4 years ago

@tarcieri has recently implemented ChaCha*Rng via the chacha20 crate. docs are here. Breaking out into a new issue this isn't really the topic of #872...

This brings the question: should we prefer this over the current (@kazcw's) implementation?

First things first, the rand_chacha crate has 44 reverse dependencies, is clearly a rand family crate and is recommended in our book. There is no plan to retire this crate.

Second, lets look at a few stats.

Unsafe

Here are the results of running cargo geiger on the current rand_core (after #931):

$ cargo geiger
<snip>
Metric output format: x/y
    x = unsafe code used by the build
    y = total unsafe code found in the crate

Symbols: 
    :) = No `unsafe` usage found, declares #![forbid(unsafe_code)]
    ?  = No `unsafe` usage found, missing #![forbid(unsafe_code)]
    !  = `unsafe` usage found

Functions  Expressions  Impls  Traits  Methods  Dependency

0/0        0/0          0/0    0/0     0/0      ?  rand_chacha 0.2.2
2/2        557/626      0/0    0/0     14/22    !  ├── ppv-lite86 0.2.6
0/0        22/22        0/0    0/0     0/0      !  └── rand_core 0.5.1

2/2        579/648      0/0    0/0     14/22

And on chacha20:

$ cargo geiger --no-default-features --features rng
<snip>
Functions  Expressions  Impls  Traits  Methods  Dependency

7/14       199/427      0/0    0/0     1/2      !  chacha20 0.3.1
0/0        22/22        0/0    0/0     0/0      !  ├── rand_core 0.5.1
2/4        50/150       1/1    0/0     3/3      !  │   └── getrandom 0.1.14
0/0        0/0          0/0    0/0     0/0      ?  │       ├── cfg-if 0.1.10
2/2        73/95        0/0    0/0     5/11     !  │       └── libc 0.2.66
0/0        0/0          0/0    0/0     0/0      ?  └── stream-cipher 0.3.2
0/0        0/0          0/0    0/0     0/0      ?      ├── blobby 0.1.2
0/1        0/231        0/0    0/0     0/0      ?      │   └── byteorder 1.3.2
0/1        0/223        0/18   0/8     0/5      ?      └── generic-array 0.12.3
0/0        0/51         0/0    0/0     0/0      ?          └── typenum 1.11.2

11/22      344/1199     1/19   0/8     9/21

Something's wrong here: stream-chiper is only a dev-dependency and rand_core is depended on in exactly the same way as in rand_chacha. So only the first line is relevant.

Lines of code

Tokei output for rand_chacha:

$ tokei rand/rand_chacha/ cryptocorrosion/utils-simd/ppv-lite86/
-------------------------------------------------------------------------------
 Language            Files        Lines         Code     Comments       Blanks
-------------------------------------------------------------------------------
 Markdown                2           70           70            0            0
 Rust                    9         4285         3864          164          257
 TOML                    2           48           42            0            6
-------------------------------------------------------------------------------
 Total                  13         4403         3976          164          263
-------------------------------------------------------------------------------

(83% of this is from ppv-lite86).

For chacha20:

$ tokei
-------------------------------------------------------------------------------
 Language            Files        Lines         Code     Comments       Blanks
-------------------------------------------------------------------------------
 Markdown                2          170          170            0            0
 Rust                   12         1617         1163          203          251
 TOML                    1           47           40            0            7
-------------------------------------------------------------------------------
 Total                  15         1834         1373          203          258
-------------------------------------------------------------------------------

There are no dependencies we need, so that's it. A nice improvement.

Benchmarks

(64-bit Haswell)

Here's rand_chacha:

$ cargo bench --bench generators chacha
   Compiling ...

running 16 tests
test gen_bytes_chacha12      ... bench:     356,133 ns/iter (+/- 5,261) = 2875 MB/s
test gen_bytes_chacha20      ... bench:     539,102 ns/iter (+/- 8,237) = 1899 MB/s
test gen_bytes_chacha8       ... bench:     263,023 ns/iter (+/- 15,029) = 3893 MB/s
test gen_u32_chacha12        ... bench:       1,689 ns/iter (+/- 168) = 2368 MB/s
test gen_u32_chacha20        ... bench:       2,475 ns/iter (+/- 60) = 1616 MB/s
test gen_u32_chacha8         ... bench:       1,300 ns/iter (+/- 56) = 3076 MB/s
test gen_u64_chacha12        ... bench:       4,058 ns/iter (+/- 324) = 1971 MB/s
test gen_u64_chacha20        ... bench:       4,472 ns/iter (+/- 647) = 1788 MB/s
test gen_u64_chacha8         ... bench:       3,252 ns/iter (+/- 421) = 2460 MB/s
test init_chacha             ... bench:          30 ns/iter (+/- 7)

and here's chacha20:

running 16 tests
test gen_bytes_chacha12      ... bench:   1,184,161 ns/iter (+/- 75,680) = 864 MB/s
test gen_bytes_chacha20      ... bench:   1,894,244 ns/iter (+/- 16,309) = 540 MB/s
test gen_bytes_chacha8       ... bench:     824,644 ns/iter (+/- 52,944) = 1241 MB/s
test gen_u32_chacha12        ... bench:       4,607 ns/iter (+/- 97) = 868 MB/s
test gen_u32_chacha20        ... bench:       7,351 ns/iter (+/- 121) = 544 MB/s
test gen_u32_chacha8         ... bench:       3,257 ns/iter (+/- 407) = 1228 MB/s
test gen_u64_chacha12        ... bench:       9,396 ns/iter (+/- 1,104) = 851 MB/s
test gen_u64_chacha20        ... bench:      14,856 ns/iter (+/- 750) = 538 MB/s
test gen_u64_chacha8         ... bench:       6,431 ns/iter (+/- 625) = 1243 MB/s
test init_chacha             ... bench:          14 ns/iter (+/- 0)

Hmm, looks like chacha20 needs some help:

$ export RUSTFLAGS="-C target-cpu=native"
$ cargo bench --bench generators chacha
...
test gen_bytes_chacha12      ... bench:     500,699 ns/iter (+/- 28,157) = 2045 MB/s
test gen_bytes_chacha20      ... bench:     789,107 ns/iter (+/- 52,585) = 1297 MB/s
test gen_bytes_chacha8       ... bench:     357,024 ns/iter (+/- 23,501) = 2868 MB/s
test gen_u32_chacha12        ... bench:       2,139 ns/iter (+/- 56) = 1870 MB/s
test gen_u32_chacha20        ... bench:       3,260 ns/iter (+/- 116) = 1226 MB/s
test gen_u32_chacha8         ... bench:       1,571 ns/iter (+/- 32) = 2546 MB/s
test gen_u64_chacha12        ... bench:       4,472 ns/iter (+/- 55) = 1788 MB/s
test gen_u64_chacha20        ... bench:       6,714 ns/iter (+/- 421) = 1191 MB/s
test gen_u64_chacha8         ... bench:       3,338 ns/iter (+/- 46) = 2396 MB/s

Closer, but still behind rand_chacha (which gets negligible boost from target-cpu=native thanks to auto-detection).

Running chacha20's built-in benchmarks, I get around 6-6.2 cycles/byte without target-cpu=native, and 2.5-2.7 with; this is significantly short of the 1.4 cycles/byte @tarcieri claims so something gives here (perhaps just CPU-specific optimisations).

Of course, there's more to this than a few stats, and number-of-unsafe-usages is not a particularly useful comparison (since it says nothing about the size of the unsafe blocks). This is all I have time for right now. Thanks to all authors (also significantly @newpavlov).

tarcieri commented 4 years ago

You'll need -Ctarget-feature=+avx2 to enable the AVX2 backend.

tarcieri commented 4 years ago

Regarding unsafe, most of those are core::arch invocations of CPU intrinsics, which are presently all unsafe but aside from unaligned loads and stores otherwise have safe semantics (and ideally, in future versions of Rust, could be safe).

Probably the most questionable use of unsafe is here:

https://github.com/RustCrypto/stream-ciphers/blob/master/chacha20/src/rng.rs#L74

It'd be good to evaluate various safe ways of replacing that. Really it's just a same-size type conversion which ideally could be done with a type cast or using [u8; BUFFER_SIZE] in the first place, which isn't possible given the current trait bounds (since BUFFER_SIZE is 64 or 128-bytes, and many traits are only impl'd for arrays up to 32 elements)

dhardy commented 4 years ago

I get the same results with RUSTFLAGS="-C target-cpu=native -C target-feature=+avx2".

tarcieri commented 4 years ago

I just benchmarked the chacha20 crate w\ AVX2 against c2-chacha and got proportionally similar numbers.

These are on a 2.3 GHz Intel Core i9, benchmarked with -Ctarget-cpu=native (which I guess activates +avx2)

`chacha20` crate

stream-cipher/apply_keystream/1024
                        time:   [1310.7990 cycles 1317.8652 cycles 1325.9773 cycles]
                        thrpt:  [1.2949 cpb 1.2870 cpb 1.2801 cpb]
stream-cipher/apply_keystream/2048
                        time:   [2622.9998 cycles 2632.6188 cycles 2644.3250 cycles]
                        thrpt:  [1.2912 cpb 1.2855 cpb 1.2808 cpb]
stream-cipher/apply_keystream/4096
                        time:   [5221.1966 cycles 5244.4537 cycles 5274.3516 cycles]
                        thrpt:  [1.2877 cpb 1.2804 cpb 1.2747 cpb]
stream-cipher/apply_keystream/8192
                        time:   [10414.4934 cycles 10429.8827 cycles 10447.8984 cycles]
                        thrpt:  [1.2754 cpb 1.2732 cpb 1.2713 cpb]\
stream-cipher/apply_keystream/16384
                        time:   [20781.5026 cycles 20813.0523 cycles 20848.4600 cycles]
                        thrpt:  [1.2725 cpb 1.2703 cpb 1.2684 cpb]

`c2-chacha`

c2-chacha/apply_keystream/1024
                        time:   [901.0832 cycles 903.5886 cycles 907.2423 cycles]
                        thrpt:  [0.8860 cpb 0.8824 cpb 0.8800 cpb]
c2-chacha/apply_keystream/2048
                        time:   [1806.3139 cycles 1809.9486 cycles 1814.3681 cycles]
                        thrpt:  [0.8859 cpb 0.8838 cpb 0.8820 cpb]
c2-chacha/apply_keystream/4096
                        time:   [3621.2403 cycles 3634.5047 cycles 3649.5313 cycles]
                        thrpt:  [0.8910 cpb 0.8873 cpb 0.8841 cpb]
c2-chacha/apply_keystream/8192
                        time:   [7154.2516 cycles 7176.5530 cycles 7204.1146 cycles]
                        thrpt:  [0.8794 cpb 0.8760 cpb 0.8733 cpb]
c2-chacha/apply_keystream/16384
                        time:   [14277.4103 cycles 14351.7591 cycles 14445.5426 cycles]
                        thrpt:  [0.8817 cpb 0.8760 cpb 0.8714 cpb]

So ~1.3 cpb for the chacha20 crate, and ~0.9 cpb for the c2-chacha crate (i.e. c2-chacha is ~45% faster)

It looks like ppv-lite86 is capable of using AVX2 as well.

str4d commented 2 years ago

I've made some significant improvements to chacha20's performance in https://github.com/RustCrypto/stream-ciphers/pull/261. I believe the remaining gap between it and c2-chacha would be addressed by https://github.com/RustCrypto/stream-ciphers/issues/262.

dhardy commented 1 year ago

This issue is now 32 months old! Is it worth revisiting?

What are the core motivations?
How has the crate changed since this issue was opened? I see that cipher is now a required dependency.

If we don't do this, possibly we should instead support zeroize for rand_chacha.

tarcieri commented 1 year ago

@dhardy the v0.9 release made cipher a hard dependency. It also removed the RNG implementations with the goal of being able to generically use any stream cipher as an RNG, but that support never shipped, so removing the RNG implementations was probably premature.

I'd be happy to add the RNG types back and make the cipher dependency optional if there's interest in using chacha20 in lieu of rand_chacha.

dhardy commented 1 year ago

The reason that never happened so far is because (1) there is not (to me) a clear motivation for switching and (2) this crate hasn't seen a lot of maintenance activity recently. I'm not aware of any problem with rand_chacha, though perhaps it should support zeroize.

A motivation against is the larger number of crates in the dependency tree; something many would not consider an issue but a few people have brought it up in the past.

❯ cargo tree
chacha20 v0.9.0 (/home/dhardy/projects/rand/stream-ciphers/chacha20)
├── cfg-if v1.0.0
├── cipher v0.4.3
│   ├── blobby v0.3.1
│   ├── crypto-common v0.1.6
│   │   ├── generic-array v0.14.6
│   │   │   └── typenum v1.15.0
│   │   │   [build-dependencies]
│   │   │   └── version_check v0.9.4
│   │   └── typenum v1.15.0
│   └── inout v0.1.3
│       └── generic-array v0.14.6 (*)
└── cpufeatures v0.2.5

❯ cargo tree
rand v0.8.5 (/home/dhardy/projects/rand/rand)
├── libc v0.2.133
├── rand_chacha v0.3.1 (/home/dhardy/projects/rand/rand/rand_chacha)
│   ├── ppv-lite86 v0.2.16
│   └── rand_core v0.6.4 (/home/dhardy/projects/rand/rand/rand_core)
│       └── getrandom v0.2.7
│           ├── cfg-if v1.0.0
│           └── libc v0.2.133
└── rand_core v0.6.4 (/home/dhardy/projects/rand/rand/rand_core) (*)

9 for chacha20 vs 7 for the whole of rand (with default features).

tarcieri commented 1 year ago

Yeah, we could make the cipher dependency optional again like it was in v0.8 of chacha20

newpavlov commented 1 year ago

cipher became a required dependency because this issue has not moved anywhere. It allowed to simplify code a bit, but it's relatively easy to make it optional again. We can do it, if the number of dependencies is the only roadblock.

The main motivation is to unify code bases, so improvements in chacha20 would be immediately shared with the rand ecosystem and potential rand contributors would help us as well. As a minor improvement it would also reduce size of dependency tree and resulting binaries for projects which use both ThreadRng and chacha20.

dhardy commented 1 year ago

We should ensure @kazcw has the chance to comment here too.

so improvements in chacha20 would be immediately shared with the rand ecosystem and potential rand contributors would help us as well

Since the libraries don't have a huge scope and are basically done already, is this actually motivation?

As a minor improvement it would also reduce size of dependency tree and resulting binaries for projects which use both ThreadRng and chacha20.

Is this common?

My POV is still: there is little real motivation either way. But if enough other people feel strongly about this, it will likely persuade me.

tarcieri commented 1 year ago

@dhardy I wouldn’t consider any of these libraries “done” per se.

AFAIK none of them have e.g. ARM NEON support and only provide SIMD backends on x86.

Once std::simd is stable (or even before) it would probably be good to have an implementation of ChaCha which leverages it to provide portable SIMD to other platforms as well (which could potentially also provide NEON support as well)

tarcieri commented 9 months ago

@nstilt1 I'm not sure that's relevant to this issue, which is about retiring rand_chacha in favor of chacha20

nstilt1 commented 9 months ago

I saw this issue when I was looking for issues regarding the lack of support for zeroize in rand_chacha. Then, I was looking into the open issues of ppv-lite86 to see if I could handle some issues and possibly pack them into my open pull request. There are several significant open issues in that library, including: 1) It doesn't yet have implementations for NEON 2) @kazcw was looking to do a Monotrait Refactor, and it probably wouldn't be wise to add support for NEON prior to the refactor. https://github.com/cryptocorrosion/cryptocorrosion/issues/28

I was going to attempt the refactor process with help from AutoGPT, but then I realized that chacha20 already supports NEON and zeroize... so... I'm thinking of switching to chacha20 in my personal project and adding the CryptoRng impl for it in my fork of rand because I'm not quite a macro magician :/

dhardy commented 9 months ago

@nstilt1 thanks for providing a rationale for this change.

Would someone like to open a PR?

nstilt1 commented 9 months ago

Would someone like to open a PR?

I am working on it, but I've found a slight difference in implementation within chacha20.

To maintain full backwards compatibility of functions/function signatures of ChaChaXRng, the new implementation of chacha20 will need a 64-bit counter instead of 32 bits. It seems like the SIMD backend would need some relatively minor adjustments to make that possible. For most use cases of an RNG, people probably won't need to generate 1 ZiB of data from a seed + stream_id in ChaChaXRng. The output limit with a 32-bit counter is about 256 GiB. My current use case for ChaChaXRng requires significantly less than this limit.

Is it okay to keep it at a 32-bit counter, or do we want to make it 64 bits? The stream_id would then need to be 96 bits if the counter was kept at 32 bits. Optionally, there could be a feature that uses a 32-bit counter instead of 64, but... the backend would still need to be adapted to use a 64 bit counter.

tarcieri commented 9 months ago

The 32-bit vs 64-bit counters are one of the main differences between the IETF flavor of ChaCha20 (which the chacha20 crate implements) versus the legacy djb construction (which we also implement, albeit with a 32-bit counter). The only tangible difference is what happens after generating 256 GiB of keystream.

We definitely want the IETF variant to use a 32-bit counter. If someone is really interested in changing ChaCha20Legacy to use a 64-bit counter, that would be OK although probably a decent amount of work.

I would personally suggest always using the IETF construction everywhere possible so as to phase out the legacy construction (and really you should probably rekey before you hit 256 GiB of keystream anyway)

dhardy commented 9 months ago

In my opinion, minor API breaking changes are fine. I agree that most people won't need to generate more than 256 GiB. Feature-flags are probably not a good solution here.

nstilt1 commented 9 months ago

I went ahead and benched the chacha20 implementation and c2-chacha. Here are the results. With avx2 on an i9-13900k:

chacha-SIMD-comparison/fill_bytes/1024                                                                            
                        time:   [1049.2706 cycles 1053.0742 cycles 1057.4638 cycles]
                        thrpt:  [1.0327 cpb 1.0284 cpb 1.0247 cpb]
                 change:
                        time:   [-5.6030% -4.9463% -4.1824%] (p = 0.00 < 0.05)
                        thrpt:  [+4.3649% +5.2037% +5.9355%]
chacha-SIMD-comparison/fill_bytes/2048                                                                             
                        time:   [2129.9202 cycles 2136.4151 cycles 2143.7658 cycles]
                        thrpt:  [1.0468 cpb 1.0432 cpb 1.0400 cpb]
                 change:
                        time:   [-4.7206% -4.0955% -3.4680%] (p = 0.00 < 0.05)
                        thrpt:  [+3.5926% +4.2704% +4.9545%]
chacha-SIMD-comparison/fill_bytes/4096                                                                             
                        time:   [4282.8065 cycles 4308.5203 cycles 4338.9166 cycles]
                        thrpt:  [1.0593 cpb 1.0519 cpb 1.0456 cpb]
                 change:
                        time:   [-4.4289% -3.4410% -2.3134%] (p = 0.00 < 0.05)
                        thrpt:  [+2.3682% +3.5636% +4.6341%]
chacha-SIMD-comparison/fill_bytes/8192                                                                             
                        time:   [8501.0517 cycles 8542.9225 cycles 8586.8622 cycles]
                        thrpt:  [1.0482 cpb 1.0428 cpb 1.0377 cpb]
                 change:
                        time:   [-6.3241% -5.6064% -4.8652%] (p = 0.00 < 0.05)
                        thrpt:  [+5.1140% +5.9394% +6.7511%]
chacha-SIMD-comparison/fill_bytes/16384                                                                             
                        time:   [16949.4996 cycles 17016.6386 cycles 17091.7128 cycles]
                        thrpt:  [1.0432 cpb 1.0386 cpb 1.0345 cpb]
                 change:
                        time:   [-6.8543% -6.1072% -5.4542%] (p = 0.00 < 0.05)
                        thrpt:  [+5.7688% +6.5045% +7.3586%]

With neon on an M1:

chacha-SIMD-comparison/fill_bytes/1024                        
                        time:   [794.99 ns 795.52 ns 796.28 ns]
                        thrpt:  [1.1977 GiB/s 1.1988 GiB/s 1.1996 GiB/s]
                 change:
                        time:   [-62.108% -62.046% -61.983%] (p = 0.00 < 0.05)
                        thrpt:  [+163.04% +163.47% +163.91%]
chacha-SIMD-comparison/fill_bytes/2048                        
                        time:   [1.5936 µs 1.5951 µs 1.5968 µs]
                        thrpt:  [1.1945 GiB/s 1.1958 GiB/s 1.1969 GiB/s]
                 change:
                        time:   [-62.298% -62.164% -62.048%] (p = 0.00 < 0.05)
                        thrpt:  [+163.49% +164.30% +165.24%]
chacha-SIMD-comparison/fill_bytes/4096                        
                        time:   [3.2040 µs 3.2097 µs 3.2154 µs]
                        thrpt:  [1.1864 GiB/s 1.1885 GiB/s 1.1906 GiB/s]
                 change:
                        time:   [-62.041% -61.949% -61.858%] (p = 0.00 < 0.05)
                        thrpt:  [+162.18% +162.80% +163.44%]
chacha-SIMD-comparison/fill_bytes/8192                        
                        time:   [6.4385 µs 6.4500 µs 6.4624 µs]
                        thrpt:  [1.1806 GiB/s 1.1829 GiB/s 1.1850 GiB/s]
                 change:
                        time:   [-62.059% -61.963% -61.873%] (p = 0.00 < 0.05)
                        thrpt:  [+162.28% +162.90% +163.57%]
chacha-SIMD-comparison/fill_bytes/16384                        
                        time:   [12.881 µs 12.897 µs 12.913 µs]
                        thrpt:  [1.1816 GiB/s 1.1831 GiB/s 1.1846 GiB/s]
                 change:
                        time:   [-62.002% -61.928% -61.851%] (p = 0.00 < 0.05)
                        thrpt:  [+162.13% +162.66% +163.17%]

I got these comparisons by running a bench with rand_chacha first, then chacha20 in the next run... not sure if there's a more automated way to compare them like this without manually running it twice. The source is available in https://github.com/nstilt1/stream-ciphers-rng/tree/master/benches

nstilt1 commented 9 months ago

Do we want to add Rngs for XChaCha and ChaCha Legacy as well?

dhardy commented 9 months ago

I don't think so. The objective should be to provide strong RNGs which remain value-compatible with the previously-supported versions.

tarcieri commented 9 months ago

Yeah, the other variants are mostly-the-same enough it doesn't provide much value adding them as alternatives.

Nice performance numbers there!

nstilt1 commented 8 months ago

How unsafe is this: https://github.com/RustCrypto/stream-ciphers/pull/333/commits/f5b9a4df1988caccdf618e79f67f6b7571bc8a4c

vks commented 8 months ago

Very nice performance numbers indeed, this gets us close to the throughput of the reference implementation, which is about 0.7 cycles / byte with AVX2. It's also nice to see that chacha20 is faster than c2-chacha, this was not always the case.

dhardy commented 8 months ago

How unsafe is this:

Do we need those AsMut / AsRef impls? It's not clear from the methods what the use-case is and thus whether the results are being consumed correctly; if this is for internal usage there's probably a better approach.

// Unsafe conversion, assuming continuous memory layout

The word you want is contiguous.

But do we need unsafe at all?

nstilt1 commented 8 months ago

The AsMut / AsRef impls are for BlockRngCore. We don't exactly need unsafe, but it can make the code look a little bit neater since it allows for a quick, copy-free conversion between [u8] and [u32], and between [[u8]] and [u32].

nstilt1 commented 8 months ago

if this is for internal usage there's probably a better approach.

What are you thinking of?

tarcieri commented 8 months ago

For internal use you can use a private method rather than a public trait

nstilt1 commented 8 months ago

For internal use you can use a private method rather than a public trait

I'm not quite sure what you are referring to. I saw that I used KeyIvInit in the from_seed() definition, which is possibly unnecessary. Is that what you were referring to? I went ahead and added another method to replace KeyIvInit's usage with a &[u8; 32] which could either be placed in lib.rs or it could be another method in AlteredState, but we could also just get rid of AlteredState and copy the code where its methods are called.

I also saw that from_seed(), set_stream(), set_stream_bytes(), and set_word_pos() don't take borrowed input parameters, which might not be good if a user wants all values zeroized. I went ahead and added set_word_pos_bytes() but I ran into a potential issue with from_seed(), where the Seed type requires Default and AsMut impls.

I went ahead and added a Seed wrapper that impls ZeroizeOnDrop. I could do the same for the other inputs, but the seed is probably the most important input to be zeroized. If I use a wrapper for set_word_pos()'s input, I could also impl From<[u8; 8]> and eliminate the set_word_pos_bytes() method. Will get started on that now, and go from there.

nstilt1 commented 6 months ago

I saw that issue, and I agree that it could be beneficial to have .set_word_pos() and a separate .set_block_pos().

I've got a branch of chacha20 where fill_bytes() can write blocks directly to the input array using pointers, rather than using the buffer. However, there are 2 conditions that are required for the RNG to write directly to the input array:

self.index must be at the maximum value, meaning it would need to generate blocks the next time the RNG is used
the array must have at least 64 bytes that have yet to be written to

If self.index isn't at the maximum value at the beginning, the RNG will exhaust its buffer and then check again, this time knowing that self.index is already at the maximum value.

If everybody is okay with that functionality and it were to make it to release, then it could be detrimental to call .set_word_pos() prior to filling an array. This is the way that .set_word_pos() currently works

for a u64 input, the last 4 bits of the first little-endian byte determines the RNG's index, setting it to a value between 0 and 15. This is approximately the same as rand_chacha
- this could be changed to set self.index to the maximum value instead of 0
for a [u8; 5] input, the last 4 bits of the last little-endian byte determines the RNG's index, and it sets the index the same way
- this could also be changed as above

With a set_block_pos(), it would be possible to change the block_pos without a chance of interfering with the index, and thus, without interfering with fill_bytes().

My reasoning behind the alternative inputs is for a specific use-case. Many cryptographic operations output bytes... particularly Key Derivation Functions (KDFs) and HKDFs. Using alternative byte inputs removes some conversions when trying to set some values arbitrarily.

nstilt1 commented 6 months ago

Should chacha20 use BlockRng? If so, BlockRngCore::Results would need to have Zeroize added as a bound when its feature is enabled.

dhardy commented 6 months ago

This is the type of use-case BlockRng exists for, so if changes are needed there then please do so.

nstilt1 commented 3 months ago

I think I have done most of what I can with the chacha20 PR. If anyone wants to make any adjustments, go on ahead.

There are a few things that might be unnecessary, such as the WordPosInput, StreamId, and BlockPos structs. They were originally there as a way to conveniently initialize a ChaCha Rng from an HKDF... but then I realized what an HKDF is for. The structs are mildly convenient for testing, and some people might find a use for them... but other than that, they might be unnecessary.

There's also the Seed struct, whose main purpose is to ensure zeroize on drop of the seed, but it could be mildly simpler if SeedableRng could take a borrowed slice, or if there was a SeedableCryptoRng.

gendx commented 3 weeks ago

I support switching the underlying implementation from ppv-lite86 to the chacha20 crate:

The RustCrypto ecosystem is regularly maintained (https://github.com/RustCrypto/stream-ciphers) whereas the last commit for ppv-lite86 was 1.5 years ago (November 2022) (https://github.com/cryptocorrosion/cryptocorrosion).
A soundness issue has been reported last year in ppv-lite86 and isn't addressed (https://github.com/cryptocorrosion/cryptocorrosion/pull/72).
The chacha20 crate provides optimized SIMD implementations for various platforms: SSE2, AVX2, NEON (https://github.com/RustCrypto/stream-ciphers/tree/master/chacha20/src/backends). ppv-lite86 appears to only support SSE2 (https://github.com/cryptocorrosion/cryptocorrosion/tree/master/utils-simd/ppv-lite86/src/x86_64). I haven't run any benchmarks to compare the performance though.
The types defined by ppv-lite86 will likely be superseded by the std::simd types (once these stabilize).

rust-random / rand