idontgetoutmuch / random

Random number library
Other
3 stars 2 forks source link

Clusivity, coverage and performance #113

Open curiousleo opened 4 years ago

curiousleo commented 4 years ago

Recent discussions about pseudo-random floating point numbers have revealed potential shortcomings of the current API in two aspects.

Clusivity describes whether a range is inclusive or exclusive in its bounds.

Coverage describes how many representable values in a range can be reached.

Integral types

For integral types, coverage is not a problem: all Uniform and UniformRange instances for integral types we currently have generate all representable values in the relevant ranges.

Clusivity is also simpler for integral types. It is straightforward to implement express exclusive ranges in terms of inclusive ranges and vice versa. Concretely, (a, b) == [a+1, b-1], for example.

Non-integral types

For non-integral types, full coverage for arbitrary ranges is difficult. The only project I know of that achieves this is rademacher-fpl.

We can improve coverage by generating a floating point number in the unit interval with full coverage and then translating it into the target range. This is the approach taken in #102, more details in https://github.com/idontgetoutmuch/random/issues/105.

The performance overhead to improve coverage in this way is small but non-zero, see #102.

Clusivity is also more complicated for non-integral types.

We have [a, b) == (a, b] + x and (a, b] == [a, b) - x for some x. To go from an inclusive-inclusive range to any other range, I believe that rejection sampling is necessary.

In addition, I know of no simple method (without full coverage) that generates an inclusive-inclusive range.

To summarise which methods generate which clusivity:

As a result, clusivity and coverage appear to be linked for non-integral types.

Proposals and questions

Q: Do users benefit from being able to control coverage? Reduced coverage in itself is never useful, but some users may want the performance boost. See #102 for benchmark results.

https://github.com/idontgetoutmuch/random/pull/104 as it currently stands exposes clusivity in the API.

https://github.com/idontgetoutmuch/random/pull/104#discussion_r412742939 (snippet no. 2) suggests a more general way to parameterise ranges. AFAICT this could be used to expose different coverages too, as well as more general "intervals" like open / closed complex disk.

Shimuuar commented 4 years ago

Clusivity

With noninclusive intervals immediately arise problem of empty intervals: (a,a), [a,a), (a,a] all contain no values. So all generator could do is to throw exception when encounter such range. Also care should be taken to avoid over/underflows

With floating point numbers same approach could be used: (a,b)[next_representable a, prev_representable b] with maybe special case for zero and/or denormalized numbers.

Coverage

Is full coverage necessary? Even for floats minimal normal number is 1.17549435e-38. And if generating number takes 1ns it will be generated with probability ~1e-21/CPU·year. For doubles we're speaking about heat death of Universe googol times over. I think it's quite sensible to truncate coverage at some point.

curiousleo commented 4 years ago

With noninclusive intervals immediately arise problem of empty intervals: (a,a), [a,a), (a,a] all contain no values. So all generator could do is to throw exception when encounter such range. Also care should be taken to avoid over/underflows

Yep, those seem like important points to consider.

With floating point numbers same approach could be used: (a,b)[next_representable a, prev_representable b] with maybe special case for zero and/or denormalized numbers.

I am not sure this holds in the strange world of floating point numbers. Here's my thinking:

The current approach is to take the unit interval and translate it into the target interval, i.e. y = a + x * (b - a) where x ∈ [0,1]. The assumption is that y ∈ [a, b] as a result (the more I think about floating point numbers, the less sure I am of basic things like this!)

Now letting b' = prev_representable b, is it the case that y' = a + x * (b' - a) makes y' ∈ [a, b)?

It is possible that b' - a == b - a if the exponents of a and b differ. This would make y' == y (given the same x). As a result, we don't get the intended range.

I really hope I'm wrong, but to the best of my knowledge, that's how weird floating point numbers are. If this is correct, it makes changing clusivity of floating point ranges a pain.

Is full coverage necessary? Even for floats minimal normal number is 1.17549435e-38. And if generating number takes 1ns it will be generated with probability ~1e-21/CPU·year. For doubles we're speaking about heat death of Universe googol times over. I think it's quite sensible to truncate coverage at some point.

In #105, I've run a few experiments to demonstrate that the coverage of simple algorithms is really poor, e.g. the one we currently use only hits 8e6 out of 1e9 representable floats in the unit interval. Depending on how the randomnes-consuming code is set up, it may not take all that long to see that it's always getting the same 8e6 numbers.

Shimuuar commented 4 years ago

I am not sure this holds in the strange world of floating point numbers. Here's my thinking:

The current approach is to take the unit interval and translate it into the target interval, i.e. y = a + x * (b - a) where x ∈ [0,1]. The assumption is that y ∈ [a, b] as a result (the more I think about floating point numbers, the less sure I am of basic things like this!)

Excellent question! I think it should work as long r <- uniformR [a,b] ⇒ a <= r <= b (square bracket to signify inclusive interval). which in turn should hold as long as a + (b - a) = b which I think should hold. But I can't be sure off the cuff

In #105, I've run a few experiments to demonstrate that the coverage of simple algorithms is really poor

Operative word is full coverage. Question is how much coverage we need. Number of reachable numbers is poor metric I think. Most of representable numbers are in small neighborhood of 0 and won't be sampled in any reasonable time frame anyway. One possible metric is to test how likely that we're able to notice sampling bias.

if we take constant exponent method then whole [0,1] interval has incomplete coverage!

If we take mwc-random's method. We have 2^32 uniformly distributed values. We have 9 extra bit, so only 2^{-9} ≈ 2e-3 of unit interval is undersampled

peteroupc commented 4 years ago

@christoph-conrads, please give your thoughts on this whole discussion.


As for this topic, I am also aware of the following:

lehins commented 4 years ago

So it seems that uniformRM is a function that can fail unless both of the bounds of interval [a,b] are inclusive and that is only because we agreed on the fact that range = if a > b then [b, a] else [a, b]

What I suggest is to add proper input validation and error handling for the UniformRange class by adding MonadThrow and throwing an appropriate exception when a requested range is not valid.

I expect this will add slight overhead to performance, but at least it will be correct.

Shimuuar commented 4 years ago

Yes, throwing exception is probably inevitable when IEEE754 is concerned. If mentioned numbers are not enough here is another gem: a + (b - a) * 0.3 = Infinity for a = -m_huge, b = m_huge (m_huge — biggest representable number). So it should be fine to throw for invalid integral ranges as well.

Another question is whether intervals (a,b) : a > b throw or be interpreted as (b,a) or throw. Such flipping was introduced in order to make uniformRM defined for all inputs but since it can throw anyway it becomes less important. I don't have definite opinion on this matter

Shimuuar commented 4 years ago

as long as a + (b - a) = b which I think should hold. But I can't be sure off the cuff

One of course should think less and throw such problems at quickcheck more. Even easiest test (0<a<b) fails:

prop1 :: Float -> Float -> Property
prop1 (abs -> a0) (abs -> b0)
  = id
  $ counterexample ("a  = " ++ show a)
  $ counterexample ("b  = " ++ show b)
  $ counterexample ("b' = " ++ show (a + (b - a)))
  $ a + (b - a) == b
  where
    a = min a0 b0
    b = max a0 b0

And here is failure where result may go out of the interval:

*** Failed! Falsified (after 41 tests and 5 shrinks):     
a  = 4.2
b  = 20.6
b' = 20.600002

I think it's time to read something about linear interpolation and

lehins commented 4 years ago

We get a bit of a chicken and egg problem with this approach, since QuickCheck uses random-1.1. Also QuickCheck Gen Float isn't generating all the useful floating point values, so extreme values like NaN, +/-Infinity, -0 are never generated. So we do need to think more.

One of course should think less and throw such problems at quickcheck more

In general, though, randomized testing is a great tool and with all the work we are doing here we'll make it even better ;)

curiousleo commented 4 years ago

@Shimuuar wrote:

if we take constant exponent method then whole [0,1] interval has incomplete coverage!

That's exactly right - hence the proposal to do away with the constant exponent method (word32ToFloatInUnitInterval in #105).

If we take mwc-random's method. We have 2^32 uniformly distributed values. We have 9 extra bit, so only 2^{-9} ≈ 2e-3 of unit interval is undersampled

I'm not sure what you mean by "we have 2^32 uniformly distributed values". Out of the 2^30.0 (1065353218) representable Floats in the unit interval, mwc-random's wordToFloat reaches 2^24.6 (25165825). See #105.

curiousleo commented 4 years ago

@lehins wrote:

What I suggest is to add proper input validation and error handling for the UniformRange class by adding MonadThrow and throwing an appropriate exception when a requested range is not valid.

Do you have a suggestion for how failure could be handled in the pure API for UniformRange, the current uniformR?

uniformR :: (RandomGen g, UniformRange a) => g -> (a, a) -> (a, g)
Shimuuar commented 4 years ago

I'm not sure what you mean by "we have 2^32 uniformly distributed values". Out of the 2^30.0 (1065353218) representable Floats in the unit interval, mwc-random's wordToFloat reaches 2^24.6 (25165825). See #105.

mwc-random's methods generates number of the form: n/2^32 with 0 =< n < 2^32. Those are distributed uniformly. Catch is many of them map to single Float

We get a bit of a chicken and egg problem with this approach,

Why? When it comes to testing hypothesis QC could use specially trained coin flipping hamsters and it will work. Point is QC is very good at demolishing unfounded assumptions

christoph-conrads commented 4 years ago

The current approach is to take the unit interval and translate it into the target interval, i.e. y = a + x * (b - a) where x ∈ [0,1]. The assumption is that y ∈ [a, b] as a result (the more I think about floating point numbers, the less sure I am of basic things like this!)

a + x * (b-a) does not work in floating-point (FP) arithmetic. Period.

If you insist on this approach, you do not need to discuss intricacies of interval boundaries because the rounding error makes the discussion moot.

I really hope I'm wrong, but to the best of my knowledge, that's how weird floating point numbers are.

Floating-point numbers were designed to

This behavior is in no way helpful when computing random numbers. Instead for random number generation, its best to think of FP numbers as integer pairs on the real line with special spacing.

An FP number is composed of a sign bit, an exponent e and a signficand (or mantissa) s. Because of the sign bit, FP numbers are distributed symmetrically around zero so we can ignore the sign bit for this introduction. There is also a bias which you can ignore when computing random floats.

The figure below shows the ranges that are covered by FP numbers with different exponents; covered means that a real value from this range is best approximated by a float with the exponent shown. Zero is on the left-hand side, positive infinity is on the right-hand side, and the vertical bar denotes the start of an interval where the best FP approximation of a real value has exponent e.

All real numbers in [0,2^0) are best approximated by FP numbers with exponent e = 0. This interval starts at the left-most vertical bar and ends before the next vertical bar. The real numbers in [2^0, 2^1) are best approximated by floats with exponent e=1; this interval starts at the second vertical bar on and ends just before the next vertical bar. The best FP approximation for real values in [2^1, 2^2) possess exponent e=2. Naturally, the best FP approximation for a real value in [2^(e-1), 2^e) has exponent e (if e ≥ 0).

  e=0 e=1 e=2     e=3             e=4
0 |---|---|-------|---------------|-------------------------------| (toward ∞)

Let's say you picked an exponent e and you know the sign. Now you have to draw a significand. Within the intervals covered by floats with exponent e, the FP numbers are equally spaced:

s=0 s=1 s=2 s=3 ... s=2^{s_max}-1
|---|---|---|---....|---|

What this means for random number generation:

Let's write FP numbers as the combination of sign, exponent, and significand they are. Let's draw random numbers in the half-open interval a = (+, 0, 0) (a represents zero) and b = (+, 0, 10). The calculated random number x must then have positive sign, exponent e=0, and a random significand 0 ≤ s < 10.

Let's say a = (+,0,0) (real value zero), b = (+, e, 0) (real-value 2^e ignoring the bias), half-open interval. We can now generate a float by

[Note that floor(log2(k)) can be computed quickly for integers using CPU instructions counting the leading zeros of integers.]

Based on the triple composed of sign, exponent, significand you can start deriving the rules for random number generation in arbitrary intervals.

lehins commented 4 years ago

Do you have a suggestion for how failure could be handled in the pure API for UniformRange, the current uniformR?

@curiousleo Very simple, it also becomes monadic with MonadThrow constraint and the user then can choose to fail in Maybe, Either, IO etc.

uniformR :: (RandomGen g, UniformRange a, MonadThrow m) => g -> (a, a) -> m (a, g)
uniformR g r = runGenStateT g (uniformRM r)

In case of randomR it would translate into a partial function of course, but that is ok in my opinion, since Random class does not adhere to the full correctness that we set for uniform functions anyways.

I created a draft PR that shows how MonadThrow can be utilized for range failure: #115

curiousleo commented 4 years ago

In the interest of keeping the discussion in one place, I'm responding to https://github.com/idontgetoutmuch/random/pull/102#issuecomment-620515471 here.


Your explanations are very much appreciated @christoph-conrads!

uses round-toward-zero in its mathematical model

I'm curious about this one. What's the motivation for / what are the consequences of this choice?

Intuitively it felt right for me if all significands have equal probability if the exponent is known. Now you have rounding toward zero.

Consequences:

  • Given the exponent, all significands are equally likely.

  • Signed and unsigned floats behave perfectly symmetrical (they would not with, e.g., rounding toward positive infinity)

  • You must treat +0 and -0 as different numbers (Yes, RFPL can draw from [-0,+0]).

  • Given real numbers a < b that can be represented exactly in floating-point arithmetic, the generator returns FP values as if it had randomly drawn and rounded real values from

    • [a,b) if a ≥ 0,
    • (a',b] if b ≤ 0, where a' = nextafter(a, -∞) is the next smaller float after a,
    • (a', b) if a < 0, b > 0.

Just to be really clear, does "as if it had randomly drawn and rounded real values" here mean "as if it had drawn real values and rounded towards zero"?

With this interpretation, I have a few follow-up questions.

My understanding is that if 0 ≤ a < b then rpfl(a, b) draws real values from [a, b) and then round towards 0. This means that a is a possible output of rpfl(a, b) but b is not. Makes sense.

However, if a < b ≤ 0 then rpfl(a, b) draws real values from (a',b] where a' = nextafter(a, -∞) < a. This means that conceptually, a real value x with a' < x < a may be drawn. Since x is negative, rounding it towards zero would result in a, so in this case, a is a possible return value of rpfl(a, b), and so is b.

In other words, if 0 ≤ a < b then rpfl(a,b) may return a but not b; if a < b ≤ 0 then rpfl(a,b) may return either a or b. At first sight, that seems odd. Did I understand the behaviour correctly? If so, could you explain the reasoning for this sign-dependent open/closedness of the range?

I also wonder if this relates to your point about symmetry:

  • Signed and unsigned floats behave perfectly symmetrical
curiousleo commented 4 years ago

@Shimuuar wrote:

mwc-random's methods generates number of the form: n/2^32 with 0 =< n < 2^32. Those are distributed uniformly. Catch is many of them map to single Float

mwc-random uses multiplication, not division. In real-valued arithmetic, this would give the same result. In floating-point arithmetic, it does not. In https://github.com/idontgetoutmuch/random/issues/105 there is a comparison of the images of mwc-random's wordToFloat and wordToFloatByDivision, which actually uses division.

Excerpt:

mwc-random: wordToFloat
full  = 1065353218
count = 25165825
percentage hit = 2.362%
includes 0 = 0
includes 1 = 1

downey baseline: wordToFloatByDivision
full  = 1065353218
count = 83886081
percentage hit = 7.874%
includes 0 = 1
includes 1 = 1

See #105 for the details.

When it comes to testing hypothesis QC could use specially trained coin flipping hamsters and it will work.

lol

Point is QC is very good at demolishing unfounded assumptions

:+1:

christoph-conrads commented 4 years ago

Just to be really clear, does "as if it had randomly drawn and rounded real values" here mean "as if it had drawn real values and rounded towards zero"?

Yes.

With this interpretation, I have a few follow-up questions.

My understanding is that if 0 ≤ a < b then rpfl(a, b) draws real values from [a, b) and then round towards 0. This means that a is a possible output of rpfl(a, b) but b is not. Makes sense.

However, if a < b ≤ 0 then rpfl(a, b) draws real values from (a',b] where a' = nextafter(a, -∞) < a. This means that conceptually, a real value x with a' < x < a may be drawn. Since x is negative, rounding it towards zero would result in a, so in this case, a is a possible return value of rpfl(a, b), and so is b.

I remembered incorrectly. RFPL returns values in [nextafter(a, +∞), nextafter(b, -∞)]. The rationale can be found below.

In other words, if 0 ≤ a < b then rpfl(a,b) may return a but not b; if a < b ≤ 0 then rpfl(a,b) may return either a or b. At first sight, that seems odd. Did I understand the behaviour correctly? If so, could you explain the reasoning for this sign-dependent open/closedness of the range?

Most random number generation APIs promise to return random numbers in the half-open interval [a, b) for all a < b. This is true, for example, for std::uniform_real_distribution in C++11 or java.util.random.doubles. Some developers seem to have understood the challenges posed by round-off and missing symmetry already. The Python3 standard library function random.uniform() returns values in the closed interval [a, b] because of the round-off (the documentation says this explicitly) and NumPy asks you to map the random values in [0,1) to [a,b) yourself, see numpy.random.random_sample().

The Rademacher FPL is stuck between a hard place and a rock because it draws random numbers from (a, b] but it must return variates for [a, b). It either needs to change the bounds or it must somehow forcibly compute numbers in [a, b). RFPL does the latter because its uniform_real_distribution implementation is supposed to be usable as a drop-in replacement for std::uniform_real_distribution. There are multiple strategies to deal with this problem, e.g.,

NB: Users should pass closed intervals [a,b] to the software generating random floats.

curiousleo commented 4 years ago

@christoph-conrads wrote:

NB: Users should pass closed intervals [a,b] to the software generating random floats.

Just to be clear: is this meant as advice to someone in the position of designing an API for generating random floating point numbers in a range, i.e. "if you have the freedom to decide what to return, make it a closed interval"? If so, I would be interested in your thoughts on why that makes for a better API than the alternatives.

christoph-conrads commented 4 years ago

@christoph-conrads wrote:

NB: Users should pass closed intervals [a,b] to the software generating random floats.

Just to be clear: is this meant as advice to someone in the position of designing an API for generating random floating point numbers in a range, i.e. "if you have the freedom to decide what to return, make it a closed interval"? If so, I would be interested in your thoughts on why that makes for a better API than the alternatives.

Yes, this is meant as an API advice. Downey's algorithm computes random values in closed intervals, my algorithm computes random values in half-open intervals with the closed boundary always having the smaller modulus ([a, b) for a ≥ 0, (a, b] for signbit(b) == 1). For this reason, the API promise should be something like the following:

Given floating-point values a, b, the generator returns uniformly distributed values x such that a <= x <= b.

There are situations where you really want half-open intervals though, even on a computer. This weekend I needed random complex numbers in polar representation. The angle must be in [0, 2π) then or the value r e^{i0} will occur too often, where r is the magnitude of a complex number. I think you can come up with more examples where you must work with half-open intervals. (Pretending to work with real numbers on a computer is not a reason to offer half-open intervals.)

curiousleo commented 4 years ago

Hey all, this is my attempt at bringing the discussions about clusivity and full-coverage floating point generation together.

Definitions

A few definitions to make sure we're on the same page:

Options

What are our options regarding floating point generation? The main ones are:

One alternative I'm leaving out here is the numpy way, namely to only expose only a way to draw a floating point number from the unit interval. This leaves it up to the user to turn it into a floating point number in the desired interval.

Discussion

The guarantees we can give connect the floating point generation discussion with the clusivity discussion.

The translation method gives no strong guarantees (concrete examples here):

Depending on the use case, this is not necessarily a show-stopper. For most values, the translation method generates a value within the given bounds. But the honest semantics of the translation method is not "inclusive" or "exclusive", it is "whatever the result of lo + x * (hi - lo) is" (see Python's random.uniform).

The clusivity API is a way to communicate specific guarantees, i.e. which numbers can be generated and which cannot. As shown above, with the translation method, we don't really know. In the vast majority of cases, the method will work as expected, but not always. As such, I think that a clusivity API only makes sense if we can give guarantees, i.e. if we use a full-coverage method.

This leaves the question whether we should implement a full-coverage method.

An assumption here is that we expose either the translation method or a full-coverage method, but cannot easily do both.

Proposal

As a result of these thoughts, here is my proposal:

In addition, I intend to turn my prototype into a library that allows users to generate floating point numbers with full coverage.

While this is the less interesting path for random to take, I'd like to point out that the improvements we've made so far are already huge. I want to get those improvements out there sooner rather than later.

peteroupc commented 4 years ago

I want to point out that I am specifying pseudocode for a method (RNDRANGE(a,b)) to generate random floating-point numbers in an arbitrary range with full coverage. It uses the closed interval [a, b] because it then becomes trivial to use any other "clusivity" by rejection. In fact, seven other methods could be defined this way:

lehins commented 4 years ago

@curiousleo I support you suggestion. Just to reiterate:

I believe, it should be a fine approach as long as we document this behavior, especially for the first iteration at getting new random up and running. As you said, we already have an enormous improvement ready to get out of the door.

With regards to:

I intend to turn my prototype into a library that allows users to generate floating point numbers with full coverage.

Instead of a separate library, it might be better to add this functionality to the next version of random-1.3, whenever you do get this functionality implemented that is. I can totally see this being as future work of our proposal together with customizable clusivity. As we discussed with @Shimuuar addition of class UniformRegion r a, which would allow us to specify exactly what the range will be for all sorts of types. This way we don't have to worry about it right now, but have a plan on how we can even further better the library without any breakage.

curiousleo commented 4 years ago

@lehins yes, thanks for the clarifications. The bounds for integral and floating point numbers should be as you said.

Instead of a separate library, it might be better to add this functionality to the next version of random-1.3, whenever you do get this functionality implemented that is. I can totally see this being as future work of our proposal together with customizable clusivity.

My main goal was to clarify the scope for v1.2. My prototype implementation of the full coverage algorithm is in fact in a branch of this repo - the idea of putting it into a separate library was really just about separating that work from our goal of getting v1.2 out there.

A future release of random would of course be a great place for a full coverage floating point generator, once it has been tested and optimised and we have a good clusivity API to expose it to users.

curiousleo commented 4 years ago

I want to point out that I am specifying pseudocode for a method (RNDRANGE(a,b)) to generate random floating-point numbers in an arbitrary range with full coverage. It uses the closed interval [a, b] because it then becomes trivial to use any other "clusivity" by rejection.

Great resource, thanks for pointing it out @peteroupc!