Closed dhardy closed 5 years ago
Using feature flags can affect dependencies, right? This might break assumptions of dependencies, so this seems dangerous. I would prefer to use different types instead and require the dependencies to be generic in the type of thread RNG.
Instead of providing several variants of ThreadRng
, maybe it makes more sense to provide one conservative default and make it easy to construct custom ThreadRng
(i.e. by making it generic in the RNG). Users who know the tradeoffs could pick the RNG best suited for their use case, maybe guided by a few recommendations in Rand.
I think HashMap
should surely use a CSPRNG, otherwise it will likely be open to DOS attacks by an adversary who can observe random numbers (which would be a very plausible scenario for instance in a multiplayer game).
Yes, feature flags can affect dependencies, which is why I suggested what I did above.
Using a custom version of ThreadRng
does not help reduce memory usage or per-thread initialisation time when dependencies are already using thread_rng
.
DOS attacks
This is why I would not recommend using such a feature flag on servers. Interesting idea trying to use such an attack in multiplayer games, but there is no benefit for synchronous-step models (common in RTS) or attacking the server in server-centric models (common for FPS), and even if it were achievable vs other players the consequences are not high.
Going back to my proposal, I prefer fast_rng
over weak_rng
:
strong_rng
for anything cryptofast_rng
for stuff like HashMap
which by default is still a CSPRNG, but can be changed using a feature flagUsing a custom version of ThreadRng does not help reduce memory usage or per-thread initialisation time when dependencies are already using thread_rng.
I tried to argue that dependencies should make it possible for the user to choose the RNG, which wouldn't have this problem. If the thread RNG is not exposed via the API, it is an implementation detail and I think it should not be affected by feature flags, because it can have unintended consequences.
Related to this, there are security trade-offs to decide:
Given independent strong_rng
and fast_rng
these protections can be applied only where necessary, although if the two use the same back-end that is not the case.
@vks I don't get what you mean; you're saying use rand = { version = "0.5", features = ... }
? That is feature flags. Or you're saying don't include a thread_rng
function / state but only a macro or something allowing one to set up one's own version easily? But then each dependency would add its own thread-local generator, increasing memory usage even further. All dependencies have to use the same back-end(s) to avoid high memory usage (where several libs use Rand within an app).
I agree that having a fast_rng()
and a secure_rng()
(or strong_rng
) seems like a good way to go.
Based on that I'd be tempted to provide very few security guarantees for fast_rng()
. It's generally hard to get security from features that aren't designed for security, and I think it's great if we can optimize fast_rng()
for performance rather than security.
I don't know enough about CSPRNG to have opinions about what guarantees that secure_rng()
should provide.
On the topic of feature flags: Having feature flags which changes what guarantees fast_rng()
and secure_rng()
provides sounds very risky to me. That could result in someone wanting to relax security for library/feature X accidentally breaking security also for feature Y.
A better approach would be to encourage libraries which use rand to include feature flags which makes them use different Rng
implementations to provides different performance/security trade-offs.
While I partly agree with you @sicking it is important to consider uses like in HashMap
which need unpredictable generators in some cases, but also fast ones. I think this is the original motivation for thread_rng
.
I only see two solutions for this:
thread_rng
A few comments on the first post, but I haven't though everything through yet...
Our current implementation of thread_rng tries to satisfy the above with a fast, secure PRNG, but at the cost of high memory usage and initialisation time per thread. For applications with a low number of long-running threads this is reasonable, but for many worker threads may not be ideal.
I am really not that concerned with the memory usage. While 4kb is a lot, it is also really not that much for any system that the standard library is available on.
The init time is something to worry about a bit more. On my PC it takes 5265 ns to initialize ThreadRng
, of which only 400 ns are spend by OsRng
. That is about 200.000 initializations per second, which I agree seems quite reasonable for long-running threads.
For the situation with many worker threads, isn't it better to use the scheme I ended up with of splitting an RNG using a wrapper such as SplittableRng
(https://github.com/rust-lang-nursery/rand/pull/399#issuecomment-382434831)? Also one lesson to take away from that thread was, that the overhead of TLS really becomes a big part of the performance of a fast PRNG such as Xorshift.
Performance with thread_rng
, with caching:
test thread_rng_u32 ... bench: 2,887 ns/iter (+/- 28) = 1385 MB/s
test thread_rng_u64 ... bench: 4,320 ns/iter (+/- 26) = 1851 MB/s
Performance with thread_rng
, without caching ThreadRng
:
test thread_rng_u32 ... bench: 7,916 ns/iter (+/- 34) = 505 MB/s
test thread_rng_u64 ... bench: 8,593 ns/iter (+/- 58) = 930 MB/s
Performance with ReseedingRng<Hc128Core, ...>
replaced by XorShiftRng
:
test thread_rng_u32 ... bench: 6,323 ns/iter (+/- 13) = 632 MB/s
test thread_rng_u64 ... bench: 6,698 ns/iter (+/- 62) = 1194 MB/s
test gen_u32_xorshift ... bench: 952 ns/iter (+/- 9) = 4201 MB/s
test gen_u64_xorshift ... bench: 1,991 ns/iter (+/- 15) = 4018 MB/s
Retrieving the RNG from TLS greatly dominates the cost here.
The last two options sound very risky to me — should we ask distributors and end-users to reason about the security of whole applications? It is quite possible that the people building applications — even developers — will not know about all uses of
thread_rng
requiring secure randomness.
Yes, that is a big argument. It is the call site that determince whether ThreadRng
is good enough to use. A scheme where the application is able to weaken the guarantees seems unacceptable to me.
On the other hand, we already reserve the right to change the algorithm of StdRng
to another secure one. So making it partly configurable should not be a problem.
This brings me to ask, is having only a single user-facing function ideal? What if instead:
strong_rng
replacesthread_rng
as a source of cryptographically secure randomnessweak_rng
is added as a fast source of randomness; depending on feature flags this could just wrapstrong_rng
or could be independent
Renaming thread_rng
will be a very big breaking change, better to keep the name as it is in my opinion (or split that out into another discussion someday). And I still really dislike the name weak_rng
. But I understand this is not about names, but about offering two kinds of functionality.
Still, having a thread-local variant using a fast RNG does not offer much advantage in the common case unless the RNG is cached.
But first, why bother when generators like Randen and RDRAND claim to satisfy all requirements anyway? This is a good question to which I only have vague answers: Randen is still new and unproven and may have portability issues; RDRAND is not fully trusted; and may not be the fastest option.
The performance of RDRAND, for comparison (1 value per iteration, instead of 1000 in the previous benchmarks):
test bench_u32 ... bench: 27 ns/iter (+/- 0) = 148 MB/s
test bench_u64 ... bench: 27 ns/iter (+/- 0) = 296 MB/s
RDRAND is 5-10× slower that the current ThreadRng
.
Another, very different, option is that we keep
thread_rng
looking like it is but removeCryptoRng
support and recommend it not be used for crypto keys.
Would you really recommend thread_rng
for generating keys? At least for now we should be very careful before recommending that.
But what is the CryptoRng
trait for, if not to say the generator is acceptable for cryptography?
As I said in the first post, ultimately such usage is about trust and risk, but the CryptoRng
trait signals an intention to be secure. I agree, we should definitely not make a generator marked by this configurable to be insecure, but is it worth going the other way? Possibly not; just leaving thread_rng
as-is is another option.
I feel like there are two very common use cases which I think would be great to have very easy-to-access APIs for:
I was thinking 1 was secure_rng()
and 2 was fast_rng()
, but maybe that was wrong. In particular it might be better to encourage another API for 2 since something like fast_rng()
requires a TLS access.
Maybe what I'm asking for for 2 is more dhardy/rand#60 plus an empty ::new()
function which seeds the returned object from something like secure_rng()
or thread_rng()
?
And to be clear, I think there are many uses of rng which does not fall into either of these categories. For these I think having APIs which are explicit about which RNG algorithm is used and what the source of seed is the way to go. That way developers can choose whatever algorithm provides the tradeoffs that match their requirements.
I don't know if HashSet
falls into category 2 or not. Based on the fact that it currently uses a simple counter (though starting at a more strongly generated random number) makes me think that 2 is probably fine. But I haven't done enough research to be sure.
However the point is that if 2 is not fine for HashSet
then it can use the solution from the last paragraph in the previous comment.
Though in reality I suspect that specifically HashSet
won't use anything we're doing here since it's part of std.
Category 2 (simply fast) is actually pretty easy. Our current thread_rng
is very fast (as long as the handle is cached), though sometimes SmallRng
will be faster (we say specifically in the docs that benchmarking is important since performance is context dependent).
But when it comes to seeding a hash algorithm for HashMap
and HashSet
, the std
currently uses an internal copy of the old thread_rng
using the ISAAC generator, which is fast and possibly (hopefully) secure. This is because if the seed is known, an attacker can cause denial-of-service by causing a server to store many items in a single hash-bucket (which thus get stored in a simple list). This could instead be solved with better algorithms (e.g. a tree-based map), but these are ~usually~ sometimes slower (actually, I think hash-maps get over-used relative to their performance).
@sicking
A fast rng. Primary feature is that this generates values fast.
I don't think this is well-defined. You can usually make RNGs faster by increasing the size of the state, because it allows for more instruction parallelism. Also, in which regard should it be fast? Initialization? Generating u64
? Filling byte buffers?
@dhardy
I don't get what you mean; you're saying use rand = { version = "0.5", features = ... }? That is feature flags. Or you're saying don't include a thread_rng function / state but only a macro or something allowing one to set up one's own version easily? But then each dependency would add its own thread-local generator, increasing memory usage even further. All dependencies have to use the same back-end(s) to avoid high memory usage (where several libs use Rand within an app).
I think the dependencies should use use struct A<R: Rng>
, from_rng(&mut Rng)
and rng_constructor: Fn() -> impl Rng
so the end user in the dependency chain can choose the RNG.
struct A<R: Rng>
and -> impl Rng
both require compile-time type definition. The current thread_rng() -> ThreadRng
uses a wrapper type so changing the implementation within this type is equivalent. We would use similar wrapper types for any new thread-local generators (and I think it only makes sense to use thread-local).
You can usually make RNGs faster by increasing the size of the state, because it allows for more instruction parallelism.
This works for generating random bytes faster, but not necessarily for getting random data into other code faster. As noted in the doc, it's necessary to benchmark your code to find the fastest RNG, so it's pointless trying to find "the fastest RNG" for thread_rng
.
A question:
In the latest release (rand 0.5.0), thread_rng
doesn't seem to be included when I specify no-default-features
(it worked before). Since compile time is still very much an issue, it would be nice to document which feature I have to enable to use thread_rng
(to generate a random string).
It requires std
, which is the only thing disabled when you use no-default-features
. No way around that for now, and even if there was it wouldn't help much with compile time. Looks like this is documented in the readme to me.
You can usually make RNGs faster by increasing the size of the state, because it allows for more instruction parallelism.
There are not many RNGs for which this is true, but does help a bit for Xorshift and Xoroshiro. It does come at the cost of using more registers, so it may work better in benchmarks than in real code.
I was also thinking of vectorization.
On Wed, May 23, 2018, 19:18 Paul Dicker notifications@github.com wrote:
You can usually make RNGs faster by increasing the size of the state, because it allows for more instruction parallelism.
There are not many RNGs for which this is true, but does help a bit for Xorshift and Xoroshiro. It does come at the cost of using more registers, so it may work better in benchmarks than in real code.
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/rust-lang-nursery/rand/issues/463#issuecomment-391429303, or mute the thread https://github.com/notifications/unsubscribe-auth/AACCtFDkwR2-phZ6M5ORgekLWZRZ_ackks5t1ZnzgaJpZM4UCvJv .
For the situation with many worker threads, isn't it better to use the scheme I ended up with of splitting an RNG using a wrapper such as SplittableRng (#399 (comment))? Also one lesson to take away from that thread was, that the overhead of TLS really becomes a big part of the performance of a fast PRNG such as Xorshift.
Is this really about the overhead of TLS or about checking that the RNG was initialised? Because any call to thread_rng()
must check it was initialised.
@pitdicker your ParallelIterator
code uses rng: R::from_rng(&mut self.rng).unwrap()
which does a full initialisation of the new RNG — when using a PRNG like Hc128Rng
this is slow. Of course switching stream/jumping would be faster but potentially less secure (must be careful not to re-use streams and avoid overlap).
More important though is that any approach using splitting won't work with the current API, since new threads would need a "master" to initialise from. I don't know if std::thread::spawn
could be adapted to initialise the new thread's RNG or something, though even if it could it might not be desirable (since then threads not using an RNG still pay the cost of initialising one). Hence this technique could be used locally in a ParallelIterator
or similar but I'm not sure how it could be used to initialise a thread-local RNG.
Renaming thread_rng will be a very big breaking change
This is partially intentional to force users to choose between strong_rng
and fast_rng
. On the other hand I'm not really happy with the names strong_rng
, secure_rng
or crypto_rng
because they imply too much (see next point).
Would you really recommend thread_rng for generating keys?
Not in its current form, no. With extra protections (also using RDRAND or with forward secrecy), perhaps. But I don't think either of these can be added without a big performance decrease, so if we did add something like this it would be distinct from thread_rng
(at least, it would be accessed in a different way).
I was also thinking of vectorization.
I don't think this has any extra requirements here. @pitdicker has been experimenting with using fill_bytes
for vectorization; alternatively we could add methods like next_u128
etc. though that's off topic here; suffice it to say I don't see any reason we can't optimise thread_rng
for SIMD if we choose to.
After reviewing this again I only really see these options:
thread_rng
API roughly as it is now (but still allowing #404 if we want that)thread_rng
configurable, but only with CSPRNG implementations (e.g. allow ChaCha for less memory but slower).fast_rng
. fast_rng
may be predictable. It might be optimised for low-memory requirements and serial usage or might be optimised more for SIMD, or it might simply use thread_rng
internally; this is up to us (and might be configurable).Any further thoughts? I like the idea of the latter but it does make Rand more complex.
Is this really about the overhead of TLS or about checking that the RNG was initialised? Because any call to
thread_rng()
must check it was initialised.
The details of how thread-local storage works are still a bit fuzzy for me, and the implementation in the standard library is spread over quite a few files. Checking if it is initialized is one part. There may also be some indirection, because it has to cross a crate boundary. And at some point (on Unix) it uses libc::pthread_getspecific(key)
to get the key from TLS. I suppose that is responsible for looking up the pointer that belongs to a key.
@pitdicker your
ParallelIterator
code usesrng: R::from_rng(&mut self.rng).unwrap()
which does a full initialisation of the new RNG — when using a PRNG like Hc128Rng this is slow.
I ended up with a better scheme using splitting. But I only mentioned it in the context of many worker threads using a fast RNG.
- Above, but also add
fast_rng
.fast_rng
may be predictable. It might be optimised for low-memory requirements and serial usage or might be optimised more for SIMD, or it might simply usethread_rng
internally; this is up to us (and might be configurable).
The main problem is that some sort of fast_rng
that works with TLS will only be fast if you can cache it. I suppose there can be a use for it, but it is not easy to use in a way that is really faster than thread_rng()
.
I am afraid it can be confusing/disappointing for users. If you don't know how things work, the current thread_rng()
can be sort of magical: you can get a random number seemingly out of nowhere. Or get the other Rng
methods.
But to use fast_rng()
right, you have to realize you are getting a PRNG. In a way it is the same as cheaply initializing a PRNG. From that point on you have to handle the state yourself and pass it around, if you really want to have the benefits of a faster RNG.
When you get to that point, isn't it not just as easy to seed a small PRNG of your own choosing, possibly from thread_rng()
?
Not sure what I want to say really :smile:. I'm just not sure if we can do something meaningful to offer a fast PRNG in combination with TLS.
- Above, but make thread_rng configurable, but only with CSPRNG implementations (e.g. allow ChaCha for less memory but slower).
I am not against this, but also don't really see the problem of using more memory. Do you really think there are situations where we have TLS, but don't have abundant memory?
- Keep the
thread_rng
API roughly as it is now (but still allowing #404 if we want that)
We are really not in such a bad state in my opinion (although the Rc
can go...). #404 does no harm but also offers no advantage in my opinion.
Having a single, strong, fast thread_rng
seems preferable to having a faster, weaker fast_rng
.
Is there any reason not to make the choice of RNG backing the ThreadRng
platform-dependent? Then ThreadRng
could use AES on CPUs with AESNI (basically every x86 produced this decade) and ChaCha20 when hardware-accelerated AES is unavailable.
I would definitely recommend having a hardware accelerated AES-based RNG on platforms with hardware AES, gated on runtime feature detection. These can be implemented using a simple abstraction: the AES encryption round function alone e.g. aesenc
on Intel-family (no need for aeskeygenassist
), or AESE
on ARMv8+.
There are any number of options for a cipher-based CSPRNG to choose from which are all fully parallelizable/vectorizable. A general theme among these is AES-CTR with a periodic rekeying mechanism (see also RDRAND). Here's a specific, recent example of such an RNG:
https://blog.cr.yp.to/20170723-random.html
Regarding RDRAND specifically, I think it's perfectly fine for the case of e.g. seeding SipHash to mitigate hashDoS, but probably not the best option for some sort of general-purpose CryptoRng
.
I would definitely recommend ChaCha as a pure software option which should work everywhere, with ChaCha20 as a paranoid default or it could arguably be reduced to ChaCha12 (the best known attack on ChaCha only works up to 7 rounds and 20-rounds are a paranoid safety margin). ChaCha can be trivially accelerated using e.g. AVX2-style instructions or other vector unit primitives which are ubiquitously available (add, rotate, xor) and has a small code size.
I'm certainly not wild about things like HC-128 or ISAAC. The former saw some analysis via ESTREAM, but ChaCha20 (and its predecessor Salsa20) have seen considerably more as ChaCha20 is an MTI cipher for TLS. It definitely sounds like there's a bit of a cipher zoo going on, and I'd strongly suggest reducing the number of ciphers unless there are very strong technical arguments for doing otherwise.
FWIW, I ported a C implementation of DJB's suggestion to Rust: https://github.com/vks/aesrng
ChaCha20 (and even ChaCha8) is a lot slower than HC-128, which is why we went with that option. We considered HC-128 acceptable due to the ESTREAM recommendation and decided to remove ISAAC from Rand proper (hasn't happened yet but will).
Making use of hardware AES in thread_rng
where available sounds reasonable, though only really useful if either it's significantly faster than HC-128 or there are security concerns regarding HC-128. I guess also preferring better reviewed algorithms where they are fast is also reasonable.
I don't plan to work on this myself but PRs are welcome.
ChaCha20 (and even ChaCha8) is a lot slower than HC-128
That's a bit surprising to hear, considering chacha8 and chacha12 are consistently faster than hc128 across multiple architectures on SUPERCOP (with chacha20 often beating it out as well):
https://bench.cr.yp.to/results-stream.html
(that said, ChaCha8 is pretty much zero margin for error, and I wouldn't recommend it as it has no safety margin. ChaCha12 is a happy medium between that and ChaCha20's paranoia)
That's a bit surprising to hear, considering chacha8 and chacha12 are consistently faster than hc128 across multiple architectures on SUPERCOP (with chacha20 often beating it out as well)
Maybe our implementation needs to be optimized/vectorized.
If ChaCha can be optimised to compete that would be great. It uses a lot less memory and initialisation time (I guess this is why those benches show HC-128 as terrible on short sequences).
That's a bit surprising to hear, considering chacha8 and chacha12 are consistently faster than hc128 across multiple architectures on SUPERCOP (with chacha20 often beating it out as well)
I have wondered about those benchmarks before. Some of those benchmarks of HC-128 are even off by two orders of magnitude! Maybe it is also counting initialization time, combined with a terrible implementation? And using an implementation of ChaCha that is much faster than anything there is in Rust at the moment?
I also tried the chacha
crate, and it seems to be faster than our implementation, but still 3 times slower than HC-128.
There isn't a particularly good implementation of ChaCha in Rust right now that I know of (which is why I plan on writing one soon).
I wrote an explicitly vectorized ChaCha4 implementation for my simd_prngs
repo but it's still very slow.
I guess I missed the previous discussion of Randen, but it looks like a very nice option for platforms where hardware AES is available:
Yes [you did]: #462
Edit to clarify: Randen looks like a good RNG, but I'd want to see third-party cryptographic review before promoting it here.
@dhardy I think this can be closed, now that we use a SIMD-optimized implementation of ChaCha?
I guess. We still haven't examined Randen or forward secrecy, but these are separate topics.
Status: proposal to allow hardware-dependent generators and replace HC128 with a faster variant of ChaCha (start reading here).
This topic comes up quite a bit from various angles; I think it's time to get some ideas down about the future of
thread_rng
(regarding the 0.6 release or later).I see the following potential uses for
thread_rng
:Also, I think it's worth mentioning where we do not expect
thread_rng
to be used:thread_rng
may not be the fastest optionAnd an important note on security: we should aim to provide a secure source of random data, but ultimately it is up to users to decide how much they trust our implementation and what their risks are.
thread_rng
does not have the simplest code to review and is currently young and subject to further change. Also we may or may not implement forward secrecy (backtracking resistance), and for ultimate security solutions using no local state may be preferred.Our current implementation of
thread_rng
tries to satisfy the above with a fast, secure PRNG, but at the cost of high memory usage and initialisation time per thread. For applications with a low number of long-running threads this is reasonable, but for many worker threads may not be ideal.There are two ways we can let users influence the implementation:
thread_rng
or call a different function)Feature flags allow configuration on a per-application basis, e.g.
The last two options sound very risky to me — should we ask distributors and end-users to reason about the security of whole applications? It is quite possible that the people building applications — even developers — will not know about all uses of
thread_rng
requiring secure randomness.This brings me to ask, is having only a single user-facing function ideal? What if instead:
strong_rng
replacesthread_rng
as a source of cryptographically secure randomnessweak_rng
is added as a fast source of randomness; depending on feature flags this could just wrapstrong_rng
or could be independentAn advantage of the above is that feature-flags could allow replacing the current implementation (HC-128; 4176 bytes) with two smaller back ends (e.g. ChaCha + PCG; 136 + 16 bytes), while only compromising the speed of the secure generator.
Another advantage is that we could add forward-secrecy to
strong_rng
with less concern for performance implications.But first, why bother when generators like Randen and
RDRAND
claim to satisfy all requirements anyway? This is a good question to which I only have vague answers: Randen is still new and unproven and may have portability issues; RDRAND is not fully trusted; and may not be the fastest option.Second, what about users like
HashMap
where weaknesses are often not exploitable (depending on application design and usage) and in the worst case only allow DOS attacks (slow algorithms)? Another good question. One possible answer is that these use-cases should useweak_rng
but by default this would be secure anyway; we provide feature flags to change that but discourage usage on servers. It might seem tempting to add a third function, but, frankly, this kind of thing is probably the main use case forweak_rng
anyway.Another, very different, option is that we keep
thread_rng
looking like it is but removeCryptoRng
support and recommend it not be used for crypto keys. Then we can add a feature flag changing its implementation to an insecure generator with less concern. This may be a good option, but goes against our recent changes (switching to HC-128 and implementingCryptoRng
).BTW, lets not bikeshed
thread_rng
vsThreadRng::new
or other syntax here.