Open dhardy opened 4 years ago
You'll need -Ctarget-feature=+avx2
to enable the AVX2 backend.
Regarding unsafe
, most of those are core::arch
invocations of CPU intrinsics, which are presently all unsafe but aside from unaligned loads and stores otherwise have safe semantics (and ideally, in future versions of Rust, could be safe).
Probably the most questionable use of unsafe is here:
https://github.com/RustCrypto/stream-ciphers/blob/master/chacha20/src/rng.rs#L74
It'd be good to evaluate various safe ways of replacing that. Really it's just a same-size type conversion which ideally could be done with a type cast or using [u8; BUFFER_SIZE]
in the first place, which isn't possible given the current trait bounds (since BUFFER_SIZE is 64 or 128-bytes, and many traits are only impl'd for arrays up to 32 elements)
I get the same results with RUSTFLAGS="-C target-cpu=native -C target-feature=+avx2"
.
I just benchmarked the chacha20
crate w\ AVX2 against c2-chacha
and got proportionally similar numbers.
These are on a 2.3 GHz Intel Core i9, benchmarked with -Ctarget-cpu=native
(which I guess activates +avx2
)
chacha20
cratestream-cipher/apply_keystream/1024
time: [1310.7990 cycles 1317.8652 cycles 1325.9773 cycles]
thrpt: [1.2949 cpb 1.2870 cpb 1.2801 cpb]
stream-cipher/apply_keystream/2048
time: [2622.9998 cycles 2632.6188 cycles 2644.3250 cycles]
thrpt: [1.2912 cpb 1.2855 cpb 1.2808 cpb]
stream-cipher/apply_keystream/4096
time: [5221.1966 cycles 5244.4537 cycles 5274.3516 cycles]
thrpt: [1.2877 cpb 1.2804 cpb 1.2747 cpb]
stream-cipher/apply_keystream/8192
time: [10414.4934 cycles 10429.8827 cycles 10447.8984 cycles]
thrpt: [1.2754 cpb 1.2732 cpb 1.2713 cpb]\
stream-cipher/apply_keystream/16384
time: [20781.5026 cycles 20813.0523 cycles 20848.4600 cycles]
thrpt: [1.2725 cpb 1.2703 cpb 1.2684 cpb]
c2-chacha
c2-chacha/apply_keystream/1024
time: [901.0832 cycles 903.5886 cycles 907.2423 cycles]
thrpt: [0.8860 cpb 0.8824 cpb 0.8800 cpb]
c2-chacha/apply_keystream/2048
time: [1806.3139 cycles 1809.9486 cycles 1814.3681 cycles]
thrpt: [0.8859 cpb 0.8838 cpb 0.8820 cpb]
c2-chacha/apply_keystream/4096
time: [3621.2403 cycles 3634.5047 cycles 3649.5313 cycles]
thrpt: [0.8910 cpb 0.8873 cpb 0.8841 cpb]
c2-chacha/apply_keystream/8192
time: [7154.2516 cycles 7176.5530 cycles 7204.1146 cycles]
thrpt: [0.8794 cpb 0.8760 cpb 0.8733 cpb]
c2-chacha/apply_keystream/16384
time: [14277.4103 cycles 14351.7591 cycles 14445.5426 cycles]
thrpt: [0.8817 cpb 0.8760 cpb 0.8714 cpb]
So ~1.3 cpb for the chacha20
crate, and ~0.9 cpb for the c2-chacha
crate (i.e. c2-chacha
is ~45% faster)
It looks like ppv-lite86
is capable of using AVX2 as well.
I've made some significant improvements to chacha20
's performance in https://github.com/RustCrypto/stream-ciphers/pull/261. I believe the remaining gap between it and c2-chacha
would be addressed by https://github.com/RustCrypto/stream-ciphers/issues/262.
This issue is now 32 months old! Is it worth revisiting?
cipher
is now a required dependency.If we don't do this, possibly we should instead support zeroize
for rand_chacha
.
@dhardy the v0.9 release made cipher
a hard dependency. It also removed the RNG implementations with the goal of being able to generically use any stream cipher as an RNG, but that support never shipped, so removing the RNG implementations was probably premature.
I'd be happy to add the RNG types back and make the cipher
dependency optional if there's interest in using chacha20
in lieu of rand_chacha
.
The reason that never happened so far is because (1) there is not (to me) a clear motivation for switching and (2) this crate hasn't seen a lot of maintenance activity recently. I'm not aware of any problem with rand_chacha
, though perhaps it should support zeroize
.
A motivation against is the larger number of crates in the dependency tree; something many would not consider an issue but a few people have brought it up in the past.
❯ cargo tree
chacha20 v0.9.0 (/home/dhardy/projects/rand/stream-ciphers/chacha20)
├── cfg-if v1.0.0
├── cipher v0.4.3
│ ├── blobby v0.3.1
│ ├── crypto-common v0.1.6
│ │ ├── generic-array v0.14.6
│ │ │ └── typenum v1.15.0
│ │ │ [build-dependencies]
│ │ │ └── version_check v0.9.4
│ │ └── typenum v1.15.0
│ └── inout v0.1.3
│ └── generic-array v0.14.6 (*)
└── cpufeatures v0.2.5
❯ cargo tree
rand v0.8.5 (/home/dhardy/projects/rand/rand)
├── libc v0.2.133
├── rand_chacha v0.3.1 (/home/dhardy/projects/rand/rand/rand_chacha)
│ ├── ppv-lite86 v0.2.16
│ └── rand_core v0.6.4 (/home/dhardy/projects/rand/rand/rand_core)
│ └── getrandom v0.2.7
│ ├── cfg-if v1.0.0
│ └── libc v0.2.133
└── rand_core v0.6.4 (/home/dhardy/projects/rand/rand/rand_core) (*)
9 for chacha20
vs 7 for the whole of rand
(with default features).
Yeah, we could make the cipher
dependency optional again like it was in v0.8 of chacha20
cipher
became a required dependency because this issue has not moved anywhere. It allowed to simplify code a bit, but it's relatively easy to make it optional again. We can do it, if the number of dependencies is the only roadblock.
The main motivation is to unify code bases, so improvements in chacha20
would be immediately shared with the rand
ecosystem and potential rand
contributors would help us as well. As a minor improvement it would also reduce size of dependency tree and resulting binaries for projects which use both ThreadRng
and chacha20
.
We should ensure @kazcw has the chance to comment here too.
so improvements in chacha20 would be immediately shared with the rand ecosystem and potential rand contributors would help us as well
Since the libraries don't have a huge scope and are basically done already, is this actually motivation?
As a minor improvement it would also reduce size of dependency tree and resulting binaries for projects which use both ThreadRng and chacha20.
Is this common?
My POV is still: there is little real motivation either way. But if enough other people feel strongly about this, it will likely persuade me.
@dhardy I wouldn’t consider any of these libraries “done” per se.
AFAIK none of them have e.g. ARM NEON support and only provide SIMD backends on x86.
Once std::simd
is stable (or even before) it would probably be good to have an implementation of ChaCha which leverages it to provide portable SIMD to other platforms as well (which could potentially also provide NEON support as well)
@nstilt1 I'm not sure that's relevant to this issue, which is about retiring rand_chacha
in favor of chacha20
I saw this issue when I was looking for issues regarding the lack of support for zeroize
in rand_chacha
. Then, I was looking into the open issues of ppv-lite86
to see if I could handle some issues and possibly pack them into my open pull request. There are several significant open issues in that library, including:
1) It doesn't yet have implementations for NEON
2) @kazcw was looking to do a Monotrait Refactor, and it probably wouldn't be wise to add support for NEON prior to the refactor. https://github.com/cryptocorrosion/cryptocorrosion/issues/28
I was going to attempt the refactor process with help from AutoGPT, but then I realized that chacha20
already supports NEON and zeroize
... so... I'm thinking of switching to chacha20
in my personal project and adding the CryptoRng
impl for it in my fork of rand
because I'm not quite a macro magician :/
@nstilt1 thanks for providing a rationale for this change.
Would someone like to open a PR?
Would someone like to open a PR?
I am working on it, but I've found a slight difference in implementation within chacha20.
To maintain full backwards compatibility of functions/function signatures of ChaChaXRng, the new implementation of chacha20 will need a 64-bit counter instead of 32 bits. It seems like the SIMD backend would need some relatively minor adjustments to make that possible. For most use cases of an RNG, people probably won't need to generate 1 ZiB of data from a seed + stream_id in ChaChaXRng. The output limit with a 32-bit counter is about 256 GiB. My current use case for ChaChaXRng requires significantly less than this limit.
Is it okay to keep it at a 32-bit counter, or do we want to make it 64 bits? The stream_id would then need to be 96 bits if the counter was kept at 32 bits. Optionally, there could be a feature that uses a 32-bit counter instead of 64, but... the backend would still need to be adapted to use a 64 bit counter.
The 32-bit vs 64-bit counters are one of the main differences between the IETF flavor of ChaCha20 (which the chacha20
crate implements) versus the legacy djb construction (which we also implement, albeit with a 32-bit counter). The only tangible difference is what happens after generating 256 GiB of keystream.
We definitely want the IETF variant to use a 32-bit counter. If someone is really interested in changing ChaCha20Legacy
to use a 64-bit counter, that would be OK although probably a decent amount of work.
I would personally suggest always using the IETF construction everywhere possible so as to phase out the legacy construction (and really you should probably rekey before you hit 256 GiB of keystream anyway)
In my opinion, minor API breaking changes are fine. I agree that most people won't need to generate more than 256 GiB. Feature-flags are probably not a good solution here.
I went ahead and benched the chacha20 implementation and c2-chacha. Here are the results. With avx2 on an i9-13900k:
chacha-SIMD-comparison/fill_bytes/1024
time: [1049.2706 cycles 1053.0742 cycles 1057.4638 cycles]
thrpt: [1.0327 cpb 1.0284 cpb 1.0247 cpb]
change:
time: [-5.6030% -4.9463% -4.1824%] (p = 0.00 < 0.05)
thrpt: [+4.3649% +5.2037% +5.9355%]
chacha-SIMD-comparison/fill_bytes/2048
time: [2129.9202 cycles 2136.4151 cycles 2143.7658 cycles]
thrpt: [1.0468 cpb 1.0432 cpb 1.0400 cpb]
change:
time: [-4.7206% -4.0955% -3.4680%] (p = 0.00 < 0.05)
thrpt: [+3.5926% +4.2704% +4.9545%]
chacha-SIMD-comparison/fill_bytes/4096
time: [4282.8065 cycles 4308.5203 cycles 4338.9166 cycles]
thrpt: [1.0593 cpb 1.0519 cpb 1.0456 cpb]
change:
time: [-4.4289% -3.4410% -2.3134%] (p = 0.00 < 0.05)
thrpt: [+2.3682% +3.5636% +4.6341%]
chacha-SIMD-comparison/fill_bytes/8192
time: [8501.0517 cycles 8542.9225 cycles 8586.8622 cycles]
thrpt: [1.0482 cpb 1.0428 cpb 1.0377 cpb]
change:
time: [-6.3241% -5.6064% -4.8652%] (p = 0.00 < 0.05)
thrpt: [+5.1140% +5.9394% +6.7511%]
chacha-SIMD-comparison/fill_bytes/16384
time: [16949.4996 cycles 17016.6386 cycles 17091.7128 cycles]
thrpt: [1.0432 cpb 1.0386 cpb 1.0345 cpb]
change:
time: [-6.8543% -6.1072% -5.4542%] (p = 0.00 < 0.05)
thrpt: [+5.7688% +6.5045% +7.3586%]
With neon on an M1:
chacha-SIMD-comparison/fill_bytes/1024
time: [794.99 ns 795.52 ns 796.28 ns]
thrpt: [1.1977 GiB/s 1.1988 GiB/s 1.1996 GiB/s]
change:
time: [-62.108% -62.046% -61.983%] (p = 0.00 < 0.05)
thrpt: [+163.04% +163.47% +163.91%]
chacha-SIMD-comparison/fill_bytes/2048
time: [1.5936 µs 1.5951 µs 1.5968 µs]
thrpt: [1.1945 GiB/s 1.1958 GiB/s 1.1969 GiB/s]
change:
time: [-62.298% -62.164% -62.048%] (p = 0.00 < 0.05)
thrpt: [+163.49% +164.30% +165.24%]
chacha-SIMD-comparison/fill_bytes/4096
time: [3.2040 µs 3.2097 µs 3.2154 µs]
thrpt: [1.1864 GiB/s 1.1885 GiB/s 1.1906 GiB/s]
change:
time: [-62.041% -61.949% -61.858%] (p = 0.00 < 0.05)
thrpt: [+162.18% +162.80% +163.44%]
chacha-SIMD-comparison/fill_bytes/8192
time: [6.4385 µs 6.4500 µs 6.4624 µs]
thrpt: [1.1806 GiB/s 1.1829 GiB/s 1.1850 GiB/s]
change:
time: [-62.059% -61.963% -61.873%] (p = 0.00 < 0.05)
thrpt: [+162.28% +162.90% +163.57%]
chacha-SIMD-comparison/fill_bytes/16384
time: [12.881 µs 12.897 µs 12.913 µs]
thrpt: [1.1816 GiB/s 1.1831 GiB/s 1.1846 GiB/s]
change:
time: [-62.002% -61.928% -61.851%] (p = 0.00 < 0.05)
thrpt: [+162.13% +162.66% +163.17%]
I got these comparisons by running a bench with rand_chacha first, then chacha20 in the next run... not sure if there's a more automated way to compare them like this without manually running it twice. The source is available in https://github.com/nstilt1/stream-ciphers-rng/tree/master/benches
Do we want to add Rngs for XChaCha and ChaCha Legacy as well?
I don't think so. The objective should be to provide strong RNGs which remain value-compatible with the previously-supported versions.
Yeah, the other variants are mostly-the-same enough it doesn't provide much value adding them as alternatives.
Nice performance numbers there!
Very nice performance numbers indeed, this gets us close to the throughput of the reference implementation, which is about 0.7 cycles / byte with AVX2. It's also nice to see that chacha20 is faster than c2-chacha, this was not always the case.
How unsafe is this:
Do we need those AsMut
/ AsRef
impls? It's not clear from the methods what the use-case is and thus whether the results are being consumed correctly; if this is for internal usage there's probably a better approach.
// Unsafe conversion, assuming continuous memory layout
The word you want is contiguous.
But do we need unsafe
at all?
The AsMut
/ AsRef
impls are for BlockRngCore
. We don't exactly need unsafe
, but it can make the code look a little bit neater since it allows for a quick, copy-free conversion between [u8]
and [u32]
, and between [[u8]]
and [u32]
.
if this is for internal usage there's probably a better approach.
What are you thinking of?
For internal use you can use a private method rather than a public trait
For internal use you can use a private method rather than a public trait
I'm not quite sure what you are referring to. I saw that I used KeyIvInit
in the from_seed()
definition, which is possibly unnecessary. Is that what you were referring to? I went ahead and added another method to replace KeyIvInit
's usage with a &[u8; 32]
which could either be placed in lib.rs
or it could be another method in AlteredState
, but we could also just get rid of AlteredState
and copy the code where its methods are called.
I also saw that from_seed()
, set_stream()
, set_stream_bytes()
, and set_word_pos()
don't take borrowed input parameters, which might not be good if a user wants all values zeroized. I went ahead and added set_word_pos_bytes()
but I ran into a potential issue with from_seed()
, where the Seed
type requires Default
and AsMut
impls.
I went ahead and added a Seed wrapper that impls ZeroizeOnDrop
. I could do the same for the other inputs, but the seed is probably the most important input to be zeroized. If I use a wrapper for set_word_pos()
's input, I could also impl From<[u8; 8]>
and eliminate the set_word_pos_bytes()
method. Will get started on that now, and go from there.
I saw that issue, and I agree that it could be beneficial to have .set_word_pos()
and a separate .set_block_pos()
.
I've got a branch of chacha20
where fill_bytes()
can write blocks directly to the input array using pointers, rather than using the buffer. However, there are 2 conditions that are required for the RNG to write directly to the input array:
self.index
must be at the maximum value, meaning it would need to generate blocks the next time the RNG is usedIf self.index
isn't at the maximum value at the beginning, the RNG will exhaust its buffer and then check again, this time knowing that self.index
is already at the maximum value.
If everybody is okay with that functionality and it were to make it to release, then it could be detrimental to call .set_word_pos()
prior to filling an array. This is the way that .set_word_pos()
currently works
u64
input, the last 4 bits of the first little-endian byte determines the RNG's index, setting it to a value between 0 and 15. This is approximately the same as rand_chacha
self.index
to the maximum value instead of 0[u8; 5]
input, the last 4 bits of the last little-endian byte determines the RNG's index, and it sets the index the same way
With a set_block_pos()
, it would be possible to change the block_pos
without a chance of interfering with the index, and thus, without interfering with fill_bytes()
.
My reasoning behind the alternative inputs is for a specific use-case. Many cryptographic operations output bytes... particularly Key Derivation Functions (KDFs) and HKDFs. Using alternative byte inputs removes some conversions when trying to set some values arbitrarily.
Should chacha20
use BlockRng
? If so, BlockRngCore::Results
would need to have Zeroize
added as a bound when its feature is enabled.
This is the type of use-case BlockRng
exists for, so if changes are needed there then please do so.
I think I have done most of what I can with the chacha20
PR. If anyone wants to make any adjustments, go on ahead.
There are a few things that might be unnecessary, such as the WordPosInput
, StreamId
, and BlockPos
structs. They were originally there as a way to conveniently initialize a ChaCha Rng from an HKDF... but then I realized what an HKDF is for. The structs are mildly convenient for testing, and some people might find a use for them... but other than that, they might be unnecessary.
There's also the Seed
struct, whose main purpose is to ensure zeroize on drop of the seed, but it could be mildly simpler if SeedableRng
could take a borrowed slice, or if there was a SeedableCryptoRng
.
I support switching the underlying implementation from ppv-lite86
to the chacha20
crate:
ppv-lite86
was 1.5 years ago (November 2022) (https://github.com/cryptocorrosion/cryptocorrosion).ppv-lite86
and isn't addressed (https://github.com/cryptocorrosion/cryptocorrosion/pull/72).chacha20
crate provides optimized SIMD implementations for various platforms: SSE2, AVX2, NEON (https://github.com/RustCrypto/stream-ciphers/tree/master/chacha20/src/backends). ppv-lite86
appears to only support SSE2 (https://github.com/cryptocorrosion/cryptocorrosion/tree/master/utils-simd/ppv-lite86/src/x86_64). I haven't run any benchmarks to compare the performance though.ppv-lite86
will likely be superseded by the std::simd
types (once these stabilize).
@tarcieri has recently implemented ChaCha*Rng via the
chacha20
crate. docs are here. Breaking out into a new issue this isn't really the topic of #872...This brings the question: should we prefer this over the current (@kazcw's) implementation?
First things first, the
rand_chacha
crate has 44 reverse dependencies, is clearly a rand family crate and is recommended in our book. There is no plan to retire this crate.Second, lets look at a few stats.
Unsafe
Here are the results of running
cargo geiger
on the currentrand_core
(after #931):And on
chacha20
:Something's wrong here:
stream-chiper
is only a dev-dependency andrand_core
is depended on in exactly the same way as inrand_chacha
. So only the first line is relevant.Lines of code
Tokei output for
rand_chacha
:(83% of this is from
ppv-lite86
).For
chacha20
:There are no dependencies we need, so that's it. A nice improvement.
Benchmarks
(64-bit Haswell)
Here's
rand_chacha
:and here's
chacha20
:Hmm, looks like
chacha20
needs some help:Closer, but still behind
rand_chacha
(which gets negligible boost fromtarget-cpu=native
thanks to auto-detection).Running
chacha20
's built-in benchmarks, I get around 6-6.2 cycles/byte withouttarget-cpu=native
, and 2.5-2.7 with; this is significantly short of the 1.4 cycles/byte @tarcieri claims so something gives here (perhaps just CPU-specific optimisations).Of course, there's more to this than a few stats, and number-of-
unsafe
-usages is not a particularly useful comparison (since it says nothing about the size of theunsafe
blocks). This is all I have time for right now. Thanks to all authors (also significantly @newpavlov).