Open curiousleo opened 4 years ago
With noninclusive intervals immediately arise problem of empty intervals: (a,a)
, [a,a)
, (a,a]
all contain no values. So all generator could do is to throw exception when encounter such range. Also care should be taken to avoid over/underflows
With floating point numbers same approach could be used: (a,b)
→ [next_representable a, prev_representable b]
with maybe special case for zero and/or denormalized numbers.
Is full coverage necessary? Even for floats minimal normal number is 1.17549435e-38. And if generating number takes 1ns it will be generated with probability ~1e-21/CPU·year. For doubles we're speaking about heat death of Universe googol times over. I think it's quite sensible to truncate coverage at some point.
With noninclusive intervals immediately arise problem of empty intervals:
(a,a)
,[a,a)
,(a,a]
all contain no values. So all generator could do is to throw exception when encounter such range. Also care should be taken to avoid over/underflows
Yep, those seem like important points to consider.
With floating point numbers same approach could be used:
(a,b)
→[next_representable a, prev_representable b]
with maybe special case for zero and/or denormalized numbers.
I am not sure this holds in the strange world of floating point numbers. Here's my thinking:
The current approach is to take the unit interval and translate it into the target interval, i.e. y = a + x * (b - a)
where x ∈ [0,1]
. The assumption is that y ∈ [a, b]
as a result (the more I think about floating point numbers, the less sure I am of basic things like this!)
Now letting b' = prev_representable b
, is it the case that y' = a + x * (b' - a)
makes y' ∈ [a, b)
?
It is possible that b' - a == b - a
if the exponents of a
and b
differ. This would make y' == y
(given the same x
). As a result, we don't get the intended range.
I really hope I'm wrong, but to the best of my knowledge, that's how weird floating point numbers are. If this is correct, it makes changing clusivity of floating point ranges a pain.
Is full coverage necessary? Even for floats minimal normal number is 1.17549435e-38. And if generating number takes 1ns it will be generated with probability ~1e-21/CPU·year. For doubles we're speaking about heat death of Universe googol times over. I think it's quite sensible to truncate coverage at some point.
In #105, I've run a few experiments to demonstrate that the coverage of simple algorithms is really poor, e.g. the one we currently use only hits 8e6 out of 1e9 representable floats in the unit interval. Depending on how the randomnes-consuming code is set up, it may not take all that long to see that it's always getting the same 8e6 numbers.
I am not sure this holds in the strange world of floating point numbers. Here's my thinking:
The current approach is to take the unit interval and translate it into the target interval, i.e. y = a + x * (b - a) where x ∈ [0,1]. The assumption is that y ∈ [a, b] as a result (the more I think about floating point numbers, the less sure I am of basic things like this!)
Excellent question! I think it should work as long r <- uniformR [a,b] ⇒ a <= r <= b
(square bracket to signify inclusive interval). which in turn should hold as long as a + (b - a) = b
which I think should hold. But I can't be sure off the cuff
In #105, I've run a few experiments to demonstrate that the coverage of simple algorithms is really poor
Operative word is full coverage. Question is how much coverage we need. Number of reachable numbers is poor metric I think. Most of representable numbers are in small neighborhood of 0 and won't be sampled in any reasonable time frame anyway. One possible metric is to test how likely that we're able to notice sampling bias.
if we take constant exponent method then whole [0,1] interval has incomplete coverage!
If we take mwc-random's method. We have 2^32 uniformly distributed values. We have 9 extra bit, so only 2^{-9} ≈ 2e-3 of unit interval is undersampled
@christoph-conrads, please give your thoughts on this whole discussion.
As for this topic, I am also aware of the following:
So it seems that uniformRM
is a function that can fail unless both of the bounds of interval [a,b]
are inclusive and that is only because we agreed on the fact that range = if a > b then [b, a] else [a, b]
What I suggest is to add proper input validation and error handling for the UniformRange
class by adding MonadThrow
and throwing an appropriate exception when a requested range is not valid.
EmptyRange
(eg. [a, a)
or a > b
in [a, b]
)Overflow
/Underflow
for integral rangesUnsupportedValue
for things like +/-Infinity
and NaN
I expect this will add slight overhead to performance, but at least it will be correct.
Yes, throwing exception is probably inevitable when IEEE754 is concerned. If mentioned numbers are not enough here is another gem: a + (b - a) * 0.3 = Infinity
for a = -m_huge
, b = m_huge
(m_huge — biggest representable number). So it should be fine to throw for invalid integral ranges as well.
Another question is whether intervals (a,b) : a > b
throw or be interpreted as (b,a)
or throw. Such flipping was introduced in order to make uniformRM
defined for all inputs but since it can throw anyway it becomes less important. I don't have definite opinion on this matter
as long as a + (b - a) = b which I think should hold. But I can't be sure off the cuff
One of course should think less and throw such problems at quickcheck more. Even easiest test (0<a<b
) fails:
prop1 :: Float -> Float -> Property
prop1 (abs -> a0) (abs -> b0)
= id
$ counterexample ("a = " ++ show a)
$ counterexample ("b = " ++ show b)
$ counterexample ("b' = " ++ show (a + (b - a)))
$ a + (b - a) == b
where
a = min a0 b0
b = max a0 b0
And here is failure where result may go out of the interval:
*** Failed! Falsified (after 41 tests and 5 shrinks):
a = 4.2
b = 20.6
b' = 20.600002
I think it's time to read something about linear interpolation and
We get a bit of a chicken and egg problem with this approach, since QuickCheck uses random-1.1
. Also QuickCheck Gen Float
isn't generating all the useful floating point values, so extreme values like NaN
, +/-Infinity
, -0
are never generated. So we do need to think more.
One of course should think less and throw such problems at quickcheck more
In general, though, randomized testing is a great tool and with all the work we are doing here we'll make it even better ;)
@Shimuuar wrote:
if we take constant exponent method then whole [0,1] interval has incomplete coverage!
That's exactly right - hence the proposal to do away with the constant exponent method (word32ToFloatInUnitInterval
in #105).
If we take mwc-random's method. We have 2^32 uniformly distributed values. We have 9 extra bit, so only 2^{-9} ≈ 2e-3 of unit interval is undersampled
I'm not sure what you mean by "we have 2^32 uniformly distributed values". Out of the 2^30.0 (1065353218) representable Floats in the unit interval, mwc-random's wordToFloat
reaches 2^24.6 (25165825). See #105.
@lehins wrote:
What I suggest is to add proper input validation and error handling for the
UniformRange
class by addingMonadThrow
and throwing an appropriate exception when a requested range is not valid.
Do you have a suggestion for how failure could be handled in the pure API for UniformRange
, the current uniformR
?
uniformR :: (RandomGen g, UniformRange a) => g -> (a, a) -> (a, g)
I'm not sure what you mean by "we have 2^32 uniformly distributed values". Out of the 2^30.0 (1065353218) representable Floats in the unit interval, mwc-random's wordToFloat reaches 2^24.6 (25165825). See #105.
mwc-random's methods generates number of the form: n/2^32 with 0 =< n < 2^32. Those are distributed uniformly. Catch is many of them map to single Float
We get a bit of a chicken and egg problem with this approach,
Why? When it comes to testing hypothesis QC could use specially trained coin flipping hamsters and it will work. Point is QC is very good at demolishing unfounded assumptions
The current approach is to take the unit interval and translate it into the target interval, i.e.
y = a + x * (b - a)
wherex ∈ [0,1]
. The assumption is thaty ∈ [a, b]
as a result (the more I think about floating point numbers, the less sure I am of basic things like this!)
a + x * (b-a)
does not work in floating-point (FP) arithmetic. Period.
b-a
may not be representable by a floating-point number (Example: a=-1
, b=1+ε
)x * (b-a)
may be larger than b-a
(Ex: x=1
, a=-1
, b=1-ε/2
)b=∞
If you insist on this approach, you do not need to discuss intricacies of interval boundaries because the rounding error makes the discussion moot.
I really hope I'm wrong, but to the best of my knowledge, that's how weird floating point numbers are.
Floating-point numbers were designed to
This behavior is in no way helpful when computing random numbers. Instead for random number generation, its best to think of FP numbers as integer pairs on the real line with special spacing.
An FP number is composed of a sign bit, an exponent e
and a signficand (or mantissa) s
. Because of the sign bit, FP numbers are distributed symmetrically around zero so we can ignore the sign bit for this introduction. There is also a bias which you can ignore when computing random floats.
The figure below shows the ranges that are covered by FP numbers with different exponents; covered means that a real value from this range is best approximated by a float with the exponent shown. Zero is on the left-hand side, positive infinity is on the right-hand side, and the vertical bar denotes the start of an interval where the best FP approximation of a real value has exponent e
.
All real numbers in [0,2^0)
are best approximated by FP numbers with exponent e = 0
. This interval starts at the left-most vertical bar and ends before the next vertical bar. The real numbers in [2^0, 2^1)
are best approximated by floats with exponent e=1
; this interval starts at the second vertical bar on and ends just before the next vertical bar. The best FP approximation for real values in [2^1, 2^2)
possess exponent e=2
. Naturally, the best FP approximation for a real value in [2^(e-1), 2^e)
has exponent e
(if e ≥ 0
).
e=0 e=1 e=2 e=3 e=4
0 |---|---|-------|---------------|-------------------------------| (toward ∞)
Let's say you picked an exponent e
and you know the sign. Now you have to draw a significand. Within the intervals covered by floats with exponent e
, the FP numbers are equally spaced:
s=0 s=1 s=2 s=3 ... s=2^{s_max}-1
|---|---|---|---....|---|
What this means for random number generation:
e=0
should be as likely as drawing e=1
,e' > e > 0
should be 2^(e'-e)
more likely than drawing e
.Let's write FP numbers as the combination of sign, exponent, and significand they are. Let's draw random numbers in the half-open interval a = (+, 0, 0)
(a
represents zero) and b = (+, 0, 10)
. The calculated random number x
must then have positive sign, exponent e=0
, and a random significand 0 ≤ s < 10
.
Let's say a = (+,0,0)
(real value zero), b = (+, e, 0)
(real-value 2^e
ignoring the bias), half-open interval. We can now generate a float by
k
in [0, 2^e-1]
;k
to get the exponent e
:
e = 0
if k = 0
,e = floor(log2(k)) + 1
otherwise;s
.[Note that floor(log2(k))
can be computed quickly for integers using CPU instructions counting the leading zeros of integers.]
Based on the triple composed of sign, exponent, significand you can start deriving the rules for random number generation in arbitrary intervals.
Do you have a suggestion for how failure could be handled in the pure API for UniformRange, the current uniformR?
@curiousleo Very simple, it also becomes monadic with MonadThrow
constraint and the user then can choose to fail in Maybe
, Either
, IO
etc.
uniformR :: (RandomGen g, UniformRange a, MonadThrow m) => g -> (a, a) -> m (a, g)
uniformR g r = runGenStateT g (uniformRM r)
In case of randomR
it would translate into a partial function of course, but that is ok in my opinion, since Random
class does not adhere to the full correctness that we set for uniform
functions anyways.
I created a draft PR that shows how MonadThrow
can be utilized for range failure: #115
In the interest of keeping the discussion in one place, I'm responding to https://github.com/idontgetoutmuch/random/pull/102#issuecomment-620515471 here.
Your explanations are very much appreciated @christoph-conrads!
uses round-toward-zero in its mathematical model
I'm curious about this one. What's the motivation for / what are the consequences of this choice?
Intuitively it felt right for me if all significands have equal probability if the exponent is known. Now you have rounding toward zero.
Consequences:
Given the exponent, all significands are equally likely.
Signed and unsigned floats behave perfectly symmetrical (they would not with, e.g., rounding toward positive infinity)
You must treat +0 and -0 as different numbers (Yes, RFPL can draw from
[-0,+0]
).Given real numbers
a < b
that can be represented exactly in floating-point arithmetic, the generator returns FP values as if it had randomly drawn and rounded real values from
[a,b)
ifa ≥ 0
,(a',b]
ifb ≤ 0
, wherea' = nextafter(a, -∞)
is the next smaller float aftera
,(a', b)
ifa < 0
,b > 0
.
Just to be really clear, does "as if it had randomly drawn and rounded real values" here mean "as if it had drawn real values and rounded towards zero"?
With this interpretation, I have a few follow-up questions.
My understanding is that if 0 ≤ a < b
then rpfl(a, b)
draws real values from [a, b)
and then round towards 0. This means that a
is a possible output of rpfl(a, b)
but b
is not. Makes sense.
However, if a < b ≤ 0
then rpfl(a, b)
draws real values from (a',b]
where a' = nextafter(a, -∞) < a
. This means that conceptually, a real value x
with a' < x < a
may be drawn. Since x
is negative, rounding it towards zero would result in a
, so in this case, a
is a possible return value of rpfl(a, b)
, and so is b
.
In other words, if 0 ≤ a < b
then rpfl(a,b)
may return a
but not b
; if a < b ≤ 0
then rpfl(a,b)
may return either a
or b
. At first sight, that seems odd. Did I understand the behaviour correctly? If so, could you explain the reasoning for this sign-dependent open/closedness of the range?
I also wonder if this relates to your point about symmetry:
- Signed and unsigned floats behave perfectly symmetrical
@Shimuuar wrote:
mwc-random's methods generates number of the form: n/2^32 with 0 =< n < 2^32. Those are distributed uniformly. Catch is many of them map to single Float
mwc-random uses multiplication, not division. In real-valued arithmetic, this would give the same result. In floating-point arithmetic, it does not. In https://github.com/idontgetoutmuch/random/issues/105 there is a comparison of the images of mwc-random's wordToFloat
and wordToFloatByDivision
, which actually uses division.
Excerpt:
mwc-random: wordToFloat
full = 1065353218
count = 25165825
percentage hit = 2.362%
includes 0 = 0
includes 1 = 1
downey baseline: wordToFloatByDivision
full = 1065353218
count = 83886081
percentage hit = 7.874%
includes 0 = 1
includes 1 = 1
See #105 for the details.
When it comes to testing hypothesis QC could use specially trained coin flipping hamsters and it will work.
lol
Point is QC is very good at demolishing unfounded assumptions
:+1:
Just to be really clear, does "as if it had randomly drawn and rounded real values" here mean "as if it had drawn real values and rounded towards zero"?
Yes.
With this interpretation, I have a few follow-up questions.
My understanding is that if
0 ≤ a < b
thenrpfl(a, b)
draws real values from[a, b)
and then round towards 0. This means thata
is a possible output ofrpfl(a, b)
butb
is not. Makes sense.However, if
a < b ≤ 0
thenrpfl(a, b)
draws real values from(a',b]
wherea' = nextafter(a, -∞) < a
. This means that conceptually, a real valuex
witha' < x < a
may be drawn. Sincex
is negative, rounding it towards zero would result ina
, so in this case,a
is a possible return value ofrpfl(a, b)
, and so isb
.
I remembered incorrectly. RFPL returns values in [nextafter(a, +∞), nextafter(b, -∞)]
. The rationale can be found below.
In other words, if
0 ≤ a < b
thenrpfl(a,b)
may returna
but notb
; ifa < b ≤ 0
thenrpfl(a,b)
may return eithera
orb
. At first sight, that seems odd. Did I understand the behaviour correctly? If so, could you explain the reasoning for this sign-dependent open/closedness of the range?
Most random number generation APIs promise to return random numbers in the half-open interval [a, b)
for all a < b
. This is true, for example, for std::uniform_real_distribution
in C++11 or java.util.random.doubles
. Some developers seem to have understood the challenges posed by round-off and missing symmetry already. The Python3 standard library function random.uniform()
returns values in the closed interval [a, b]
because of the round-off (the documentation says this explicitly) and NumPy asks you to map the random values in [0,1)
to [a,b)
yourself, see numpy.random.random_sample().
The Rademacher FPL is stuck between a hard place and a rock because it draws random numbers from (a, b]
but it must return variates for [a, b)
. It either needs to change the bounds or it must somehow forcibly compute numbers in [a, b)
. RFPL does the latter because its uniform_real_distribution
implementation is supposed to be usable as a drop-in replacement for std::uniform_real_distribution
. There are multiple strategies to deal with this problem, e.g.,
a
will never drawn because of round-toward-zero and b
will never be drawn because it is outside of the interval (currently line done by RFPL, see case 1 in include/rademacher-fpl/impl/uniform-real-distribution.hpp
.(a, b]
and return a
whenever b
is drawn.NB: Users should pass closed intervals [a,b]
to the software generating random floats.
@christoph-conrads wrote:
NB: Users should pass closed intervals [a,b] to the software generating random floats.
Just to be clear: is this meant as advice to someone in the position of designing an API for generating random floating point numbers in a range, i.e. "if you have the freedom to decide what to return, make it a closed interval"? If so, I would be interested in your thoughts on why that makes for a better API than the alternatives.
@christoph-conrads wrote:
NB: Users should pass closed intervals [a,b] to the software generating random floats.
Just to be clear: is this meant as advice to someone in the position of designing an API for generating random floating point numbers in a range, i.e. "if you have the freedom to decide what to return, make it a closed interval"? If so, I would be interested in your thoughts on why that makes for a better API than the alternatives.
Yes, this is meant as an API advice. Downey's algorithm computes random values in closed intervals, my algorithm computes random values in half-open intervals with the closed boundary always having the smaller modulus ([a, b)
for a ≥ 0
, (a, b]
for signbit(b) == 1
). For this reason, the API promise should be something like the following:
Given floating-point values
a
,b
, the generator returns uniformly distributed valuesx
such thata <= x <= b
.
There are situations where you really want half-open intervals though, even on a computer. This weekend I needed random complex numbers in polar representation. The angle must be in [0, 2π)
then or the value r e^{i0}
will occur too often, where r
is the magnitude of a complex number. I think you can come up with more examples where you must work with half-open intervals. (Pretending to work with real numbers on a computer is not a reason to offer half-open intervals.)
Hey all, this is my attempt at bringing the discussions about clusivity and full-coverage floating point generation together.
A few definitions to make sure we're on the same page:
What are our options regarding floating point generation? The main ones are:
x
in the unit interval and translate it into the target interval via lo + x * (hi - lo)
.
Pro: simple code, fast. Con: low coverage, no strong guarantees.One alternative I'm leaving out here is the numpy
way, namely to only expose only a way to draw a floating point number from the unit interval. This leaves it up to the user to turn it into a floating point number in the desired interval.
The guarantees we can give connect the floating point generation discussion with the clusivity discussion.
The translation method gives no strong guarantees (concrete examples here):
lo
is very small and hi
is very large, it returns NaN since then hi - lo
is not representable;x = 1
and lo
and hi
sufficiently small, the number it returns is larger than hi
.Depending on the use case, this is not necessarily a show-stopper. For most values, the translation method generates a value within the given bounds. But the honest semantics of the translation method is not "inclusive" or "exclusive", it is "whatever the result of lo + x * (hi - lo)
is" (see Python's random.uniform
).
The clusivity API is a way to communicate specific guarantees, i.e. which numbers can be generated and which cannot. As shown above, with the translation method, we don't really know. In the vast majority of cases, the method will work as expected, but not always. As such, I think that a clusivity API only makes sense if we can give guarantees, i.e. if we use a full-coverage method.
This leaves the question whether we should implement a full-coverage method.
An assumption here is that we expose either the translation method or a full-coverage method, but cannot easily do both.
As a result of these thoughts, here is my proposal:
UniformRange
instanceslo - x * (hi - lo)
isx
(unit interval) using a simple method; using division as we are right now works just fineIn addition, I intend to turn my prototype into a library that allows users to generate floating point numbers with full coverage.
While this is the less interesting path for random
to take, I'd like to point out that the improvements we've made so far are already huge. I want to get those improvements out there sooner rather than later.
I want to point out that I am specifying pseudocode for a method (RNDRANGE(a,b)
) to generate random floating-point numbers in an arbitrary range with full coverage. It uses the closed interval [a, b] because it then becomes trivial to use any other "clusivity" by rejection. In fact, seven other methods could be defined this way:
a
and b
, but with the other "clusivities", derived using rejection.0
and 1
, but with all four "clusivities".@curiousleo I support you suggestion. Just to reiterate:
[a, b]
[0, 1]
, but when this value in unit range is scaled to some other range then it will be best effort of producing inclusive, but there can't be any hard guarantees that this will be the case. (i.e. your counterexamples in https://github.com/idontgetoutmuch/random/issues/105#issuecomment-621335855)I believe, it should be a fine approach as long as we document this behavior, especially for the first iteration at getting new random
up and running. As you said, we already have an enormous improvement ready to get out of the door.
With regards to:
I intend to turn my prototype into a library that allows users to generate floating point numbers with full coverage.
Instead of a separate library, it might be better to add this functionality to the next version of random-1.3
, whenever you do get this functionality implemented that is. I can totally see this being as future work of our proposal together with customizable clusivity. As we discussed with @Shimuuar addition of class UniformRegion r a
, which would allow us to specify exactly what the range will be for all sorts of types. This way we don't have to worry about it right now, but have a plan on how we can even further better the library without any breakage.
@lehins yes, thanks for the clarifications. The bounds for integral and floating point numbers should be as you said.
Instead of a separate library, it might be better to add this functionality to the next version of
random-1.3
, whenever you do get this functionality implemented that is. I can totally see this being as future work of our proposal together with customizable clusivity.
My main goal was to clarify the scope for v1.2. My prototype implementation of the full coverage algorithm is in fact in a branch of this repo - the idea of putting it into a separate library was really just about separating that work from our goal of getting v1.2 out there.
A future release of random
would of course be a great place for a full coverage floating point generator, once it has been tested and optimised and we have a good clusivity API to expose it to users.
I want to point out that I am specifying pseudocode for a method (
RNDRANGE(a,b)
) to generate random floating-point numbers in an arbitrary range with full coverage. It uses the closed interval [a, b] because it then becomes trivial to use any other "clusivity" by rejection.
Great resource, thanks for pointing it out @peteroupc!
Recent discussions about pseudo-random floating point numbers have revealed potential shortcomings of the current API in two aspects.
Clusivity describes whether a range is inclusive or exclusive in its bounds.
Coverage describes how many representable values in a range can be reached.
Integral types
For integral types, coverage is not a problem: all
Uniform
andUniformRange
instances for integral types we currently have generate all representable values in the relevant ranges.Clusivity is also simpler for integral types. It is straightforward to implement express exclusive ranges in terms of inclusive ranges and vice versa. Concretely,
(a, b) == [a+1, b-1]
, for example.Non-integral types
For non-integral types, full coverage for arbitrary ranges is difficult. The only project I know of that achieves this is rademacher-fpl.
We can improve coverage by generating a floating point number in the unit interval with full coverage and then translating it into the target range. This is the approach taken in #102, more details in https://github.com/idontgetoutmuch/random/issues/105.
The performance overhead to improve coverage in this way is small but non-zero, see #102.
Clusivity is also more complicated for non-integral types.
We have
[a, b) == (a, b] + x
and(a, b] == [a, b) - x
for somex
. To go from an inclusive-inclusive range to any other range, I believe that rejection sampling is necessary.In addition, I know of no simple method (without full coverage) that generates an inclusive-inclusive range.
To summarise which methods generate which clusivity:
As a result, clusivity and coverage appear to be linked for non-integral types.
Proposals and questions
Q: Do users benefit from being able to control coverage? Reduced coverage in itself is never useful, but some users may want the performance boost. See #102 for benchmark results.
https://github.com/idontgetoutmuch/random/pull/104 as it currently stands exposes clusivity in the API.
https://github.com/idontgetoutmuch/random/pull/104#discussion_r412742939 (snippet no. 2) suggests a more general way to parameterise ranges. AFAICT this could be used to expose different coverages too, as well as more general "intervals" like open / closed complex disk.