Consider Fast Loaded Dice Roller for `WeightedIndex`

vks commented 4 years ago

It seems to outperform the alias method (and others), being close to the theoretical optimum of entropy use. According to the paper, it is faster than the alias method (if the latter is modified to produce exact samples), while using several orders of magnitudes less preprocessing time. Maybe it can replace our current algorithms in Rand?

There is a reference implementation in C and Python.

vks commented 4 years ago

I played a bit with the algorithm, but I could not make it perform as well as our alias-method implementation.

The algorithm consumes the random numbers bit-by-bit. On the one hand, this avoids throwing away random data, on the other hand, the approach does not buy much, because generating random bits is cheap with modern generators. I suspect the results in the paper are with a slow RNG like the Mersenne Twister.

There is also a similar Fast Dice Roller algorithm for uniform sampling, which would also save a lot of random bits for small ranges. However, in practice it seems to be much slower with a fast RNG.

peteroupc commented 4 years ago

Vks wrote:

I played a bit with the algorithm, but I could not make it perform as well as our alias-method implementation.

The algorithm consumes the random numbers bit-by-bit. On the one hand, this avoids throwing away random data, on the other hand, the approach does not buy much, because generating random bits is cheap with modern generators. I suspect the results in the paper are with a slow RNG like the Mersenne Twister.

There is also a similar Fast Dice Roller algorithm for uniform sampling, which would also save a lot of random bits for small ranges. However, in practice it seems to be much slower with a fast RNG.

@fsaad, can you please comment?

fsaad commented 4 years ago

Thanks for trying out FLDR.

I played a bit with the algorithm, but I could not make it perform as well as our alias-method implementation.

@vks Does your alias implementation use exact sampling or floating-point approximations of real numbers? The runtime comparison, both theoretical and practical, between FLDR and alias sampling is with respect to error-free sampling. If the alias implementation uses floating-point approximations of real numbers then there is not much that can analyzed in terms of performance with FLDR, since an inexact alias implementation typically trade-offs error with runtime, whereas FLDR is designed to always be error free.

If you have an error-free alias sampler, I would be really interested to explore its performance as compared to FLDR, and I would be quite surprised if FLDR is slower (especially on low to medium entropy distributions), both for preprocessing and sampling.

The algorithm consumes the random numbers bit-by-bit. On the one hand, this avoids throwing away random data, on the other hand, the approach does not buy much, because generating random bits is cheap with modern generators.

While the FLDR paper uses a simple (naive) serial implementation of the sampler (Alg 5), there are many other ways to implement that leverage alternate data structures, which include linear encodings of the DDG sampling tree (Algs 4--7 Saad et al. 2020), bit-slicing for constant time generation (Sec 3, Karmakar et al. 2018), and batch generating with bit recycling (Alg 3, Devroye & Gravel 2018), among many others.

The choice of implementation depends on the constraints of the application. For example, lazy DDG sampling is typically far more efficient as compared to floating-point arithmetic when implementing the sampler directly on hardware (e.g., an FPGA). While FLDR is characterized by its theoretical near-optimal entropy use, many of these techniques can be used in practice to push down the constant factors of runtime (or memory).

There is also a similar Fast Dice Roller algorithm for uniform sampling, which would also save a lot of random bits for small ranges. However, in practice it seems to be much slower with a fast RNG.

The FDR of Lumbroso 2013 is an exact uniform sampler. Unlike FLDR, the FDR is theoretically optimal (for uniform sampling), so it should be superior to any other exact uniform sampler, though I agree that non-exact samplers can be faster in practice.

Overall, I think it is very valuable to compare the performance of exact and approximate sampling, as your ticket may point toward. The status quo of most existing software libraries is inexact sampling with floating-point. FLDR is an initial step toward showing that exact sampling can be practical, both theoretically and for applications, for general discrete distributions. There is of course a lot more work to do to explore the trade-offs with inexact samplers.

dhardy commented 4 years ago

The code for our weighted alias implementation is here. IIRC it uses exact sampling, with the requirement that the sum of weights can be represented by the weight type. Weight type may be an integer or floating-point or a user-defined type; the latter implies that rounding can occur, depending on the user's choice of types.

In many benchmarks however I found the alias method a worse choice than the binary search method due to the set-up time, hence the latter is our default (and also exact if used with integer types).

vks commented 4 years ago

Does your alias implementation use exact sampling or floating-point approximations of real numbers?

I don't think we use floating-point when the weights are integers. I did not look into the details, but I doubt we use the alias method documented in the paper:

(vi) exact alias sampler (Walker, 1977), using entropy-optimal uniform and Bernoulli sampling (Lumbroso, 2013) and the one-table implementation (Vose, 1991).

In particular, we are not using the FDR for uniform sampling, which would be twice as slow.

Here are the benchmark results:

weighted_u32              :      12,472 ns/iter (+/- 308)
weighted_large_set        :      72,129 ns/iter (+/- 2,319)
weighted_new              :      39,283 ns/iter (+/- 876)

weighted_alias_u32        :      11,029 ns/iter (+/- 327)
weighted_alias_large_set  :      12,318 ns/iter (+/- 366)
weighted_alias_new        :     261,343 ns/iter (+/- 7,985)

weighted_fldr_i32         :      17,420 ns/iter (+/- 365)
weighted_fldr_large_set   :      29,980 ns/iter (+/- 671)
weighted_fldr_new         :     935,315 ns/iter (+/- 26,667)

I also implemented the FDR uniform sampling. It's a bit awkward to implement with our API, so I'm not sure it's optimized as well as it could be.

uniform     :       2,032 ns/iter (+/- 119) = 492 MB/s
uniform_fdr :       4,373 ns/iter (+/- 212) = 228 MB/s

The code is here.

peteroupc commented 4 years ago

For your information there are numerous data structures and algorithms for weighted sampling with replacement, besides the Fast Loaded Dice Roller, binary search, and the alias method, and I list a great number of them. Perhaps it would be a good idea to compare their performance.

fsaad commented 4 years ago

@vks Thanks a lot for the code. I'm not an expert in Rust, but it seems that there are many ways that this FLDR implementation can be optimized. One immediate thought is that, in your current implementation, the PRNG is always called once per sample, i.e., https://github.com/rust-random/rand/blob/e2b4b1259b269a80298508b7300f8ba95aca33da/rand_distr/src/weighted_fldr.rs#L85

In our reference implementation, we separately store a "global" buffer of 31 random bits (in C, rand() produces a positive 32 bit signed integer, so only 31 of the bits are random), which is the function "flip()" in the following file:

https://github.com/probcomp/fast-loaded-dice-roller/blob/master/src/c/flip.c

The main FLDR sampler then calls "flip()" to obtain a fresh bit:

https://github.com/probcomp/fast-loaded-dice-roller/blob/2fb32e94a04c744c86aa4cb54b3ee0259894e777/src/c/fldr.c#L93

which may or not may not involve a call to the PRNG, depending on whether the buffer is empty. This approach ensures that we do not spuriously generate random bits. If FLDR is implemented to use a separate buffer that is shared across calls to "sample", instead of calling the PRNG once per sample, then I do expect the runtime should improve in many regimes.

vks commented 4 years ago

@fsaad Thanks for having a look!

I'm not an expert in Rust, but it seems that there are many ways that this FLDR implementation can be optimized. One immediate thought is that, in your current implementation, the PRNG is always called once per sample, i.e.,

Fair enough, that is a very good point! Unfortunately, implementing this is a bit tricky, because our API requires sample to not modify the distribution object. There are the following solutions:

Add some API for sampling with modifying the distribution object.
Use Cell. This means that the distribution object can no longer be shared by threads.

Going with 2. seems to have an adverse effects on the performance for large sets, while improving it for small sets:

weighted_fldr_i32         :      12,733 ns/iter (+/- 53)
weighted_fldr_large_set   :      36,303 ns/iter (+/- 536)
weighted_fldr_new         :     911,047 ns/iter (+/- 17,077)

newpavlov commented 4 years ago

Personally I don't like option 2. Can we implement Distribution<T> only for &mut reference of a type?

dhardy commented 4 years ago

Distribution::sample is immutable with respect to the distribution for good reasons: (1) API simplicity, (2) basically all distributions have fast (enough) immutable implementations and (3) because it places fewer restrictions on usage.

It's going to take quite a good reason to convince me that we should change that. Especially since the benchmarks so far have the alias-method winning everywhere (especially in set-up, where I already thought the alias method was slow).

fsaad commented 4 years ago

Distribution::sample is immutable with respect to the distribution for good reasons

I agree. Does your API For Rng support efficient and lazy production of random bits one at a time, for example by storing an internal buffer (using a similar method to the "global" buffer approach)? That way, sample can leverage mutability with respect to rng.

the benchmarks so far have the alias-method winning everywhere (especially in set-up, where I already thought the alias method was slow).

Do the benchmarks use test distributions with varying entropy, i.e., for a distribution with n outcomes, entropies from 0 to log_2(n)?

The average runtime of sampling in FLDR is a direct function of the entropy. The runtime of preprocessing also varies with the values of probabilities in the test distribution. The point about set-up is surprising. Do you have a pointer to the benchmarks and the measurements of setup runtime?

dhardy commented 4 years ago

Does your API For Rng support efficient and lazy production of random bits one at a time

Good question. No. Theoretically it could, even using different approaches depending on the RNG, but we chose not to do that. For simpler PRNGs we use no buffer at all and for block-based PRNGs we use a block which we read off one word (32-bit) at a time.

Another factor to consider (for both mutable distribution and buffered RNGs) is reproducible behaviour. If we allowed out-of-order usage of random bits/bytes it becomes much harder to reason about and especially document the behaviour, thus making reproducibility hard. If we don't allow this, then every read has to either be compatible with misaligned buffers (slow) or check and clear up to the next alignment point (also slow). At this point we're no longer optimising one RNG+distribution pair but expected usage of the RNG across a number of distributions.

I suppose we could side-step that problem by having a special RNG optimised for bit- or byte-sized reads, though that's another step in complexity.

@vks which RNG did you use for these benchmarks?

vks commented 4 years ago

Another factor to consider (for both mutable distribution and buffered RNGs) is reproducible behaviour. If we allowed out-of-order usage of random bits/bytes it becomes much harder to reason about and especially document the behaviour, thus making reproducibility hard.

Aren't we having this problem already when mixing fill_bytes and next_u64?

I suppose we could side-step that problem by having a special RNG optimised for bit- or byte-sized reads, though that's another step in complexity.

In principle, we could have a wrapper type implementing the buffering. It can then be used as needed.

For example, we could just add a bit: u8 field to BlockRng and a gen_bool method that uses it.

@vks which RNG did you use for these benchmarks?

It's just our default benchmarks, they use Pcg64Mcg.

Do you have a pointer to the benchmarks and the measurements of setup runtime?

See here.

newpavlov commented 4 years ago

For example, we could just add a bit: u8 field to BlockRng and a gen_bool method that uses it.

I think we can use a single counter for used bits, which would be incremented by 8 for read byte, 32 for u32, etc. We would need to perform additional computations to correctly align read point (e.g. if index is equal to 9, we would have to align it to 16 for reading byte and to 64 for reading u32). It would degrade performance a bit, but I am not sure if it will be noticeable.

As an advantage this approach will be less wasteful (and thus more performant) for users using primarily fill_bytes. But it would change behavior of BlockRng between releases, which may create problems for users who rely on reproducibility.

riking commented 3 years ago

Re compromising performance on pure u64: I think you can reduce that cost to a single branch.

The algorithm consumes the random numbers bit-by-bit. [...] reproducible behavior

First, we discard this as a design constraint when the application is making RNG dependent range requests. I do not believe any application should expect seeded RNG stability under a reordering of range request sizes.

Deterministic behavior for a seeded RNG under a deterministic request stream should be preserved.

(Note: it is theoretically possible for this to weaken the stream somehow, when considered as a complete system with the application. Probably a fun research paper or game speedrun investigation for someone.)

The "one branch" algorithm is then simple: for any "large enough" requests, bypass or discard the buffer.

Any requests for over half the source RNG bit length (>32.0 bits) should bypass the buffer, as they are unlikely to be followed by small enough requests. Note that 63 and 53 bits are common request lengths; there is little to no profit to be had in attempting to merge leftover bits there.

Any requests for more bits than are contained in the buffer (e.g. buffer has 2 bits remaining, request wants 6) should discard the remaining buffer instead of attempting to merge the leftovers.

elidupree commented 3 years ago

Hang on, why would we need to compromise performance at all for existing use cases? Requests for a full u64 would usually know at compile time that they want a full u64, so they could just make a method call that gets a random u64 from the backing RNG rather than asking for bits from the "spare bits buffer" at all.

I'm currently fiddling with some code that keeps just such a "spare bits buffer", and is built entirely as a wrapper around a source Rng, without having to change the Rng interface at all. For generating small uniform ranges, my wrapper around ChaChaRng is able to outperform using ChaChaRng directly. (On the other hand, it's hard to beat Pcg64Mcg. On the third hand, I haven't put a lot of work into optimizing it.)

dhardy commented 3 years ago

@elidupree see notes about reproducibility in this comment.

@riking I'll reply in #1147.

elidupree commented 3 years ago

I saw that comment, but I don't understand it. Let me ask some questions to clarify my understanding: If an Rng implementor is deterministic and platform-independent, and my code that keeps a bit buffer is a deterministic, platform-independent wrapper around the Rng, then how could the combination possibly be less reproducible than the Rng we started with? From the wording "out-of-order usage of random bits/bytes", my only guess about what you mean is that you're talking about getting different results if you switch the order between a request for bits and a request for a full u64, but that can't be right, because an Rng always gives different results if you switch the order between any two requests for numbers...

dhardy commented 3 years ago

Besides non-determinicity there are a few possible sources of non-portable results:

LE vs BE architectures when casting between [u8] and u32 etc.
library versions: an update may change results (we disallow this in patch releases)
std trig & exp functions use libm or equivalent, which is not identical across platforms (though this only affects some distributions, not RNGs)

But this is also about documenting behaviour...

because an Rng always gives different results if you switch the order between any two requests for numbers

Is this actually the case (for two different requests) given an RNG using a buffer for partially-consumed generated values? There is nothing in the RngCore doc requiring this, or whether inserting an extra request will influence future output.

elidupree commented 3 years ago

Besides non-determinicity there are a few possible sources of non-portable results:

Those are possible sources of non-portable results in general, but I still don't see how any of them would cause difficulties for implementing a gen_bits API. Endianness could be an issue if you implement it using casts, but my quick-and-dirty implementation isn't endianness-dependent (it just uses bit ops to extract the desired number of bits); I don't see how this is different from any other situation where we accept a small performance cost to guarantee platform-independence.

Is this actually the case (for two different requests) given an RNG using a buffer for partially-consumed generated values? There is nothing in the RngCore doc requiring this, or whether inserting an extra request will influence future output.

Sorry, my wording was imprecise. Of course it's not guaranteed that the values will be different (for one thing, they could be the same due to luck!) What I'm saying is that the Rng traits already do not provide any guarantee that results will be identical in any situation other than an identical sequence of requests (with identical order). I suppose implementors are permitted to provide additional guarantees (I've never heard of a trait that forbids implementors from providing additional guarantees), but is there any specific implementor that does?

dhardy commented 3 years ago

Sorry, what I wrote was besides the point. We have discussed in the past whether RNGs may discard unused bits/bytes (to which the conclusion was yes, though some attempts have been made not to do this, most recently #1031). I believe we also discussed whether an RNG may return bits out-of-order from when they are generated (using a buffer for some outputs but not others), but I don't recall and can't find the exact conclusion here. I thought we had decided against this (i.e. forcing clearing of buffered bits whenever generating fresh output), but do not see that documentation. (It's also not exactly "portability" and more "how to reproduce output given an RNG algorithm and sequence of requests".)

newpavlov commented 3 years ago

We have discussed in the past whether RNGs may discard unused bits/bytes (to which the conclusion was yes, though some attempts have been made not to do this, most recently #1031

Note that code in that PR also discards bits for performance reasons (we do not want to copy u64s with bit-level granularity). It will not discard bits only if you continuously use the gen_bool method. But if user calls next_u64 with bit-level counter equal to 65, it will get rounded to 128 (to be exact, to 2 with u64-level counter).

dhardy commented 2 months ago

We still do not support single-bit random value generation (see #1261 for related proposals), and I don't think it's very likely that we will. We also do not really have an interest in supporting mutable "distribution" objects. Therefore I think it's unlikely we will pursue the suggested FLDR algorithm.

rust-random / rand

Consider Fast Loaded Dice Roller for `WeightedIndex` #1014