Floating point numbers - Githubissues

curiousleo commented 4 years ago

This issue motivates and gives context for https://github.com/idontgetoutmuch/random/pull/102, where I have implemented a method that lets us generate every representable number in the inclusive unit interval with the correct probability based on Downey's method [1].

`Float` representation

There are roughly two ways to create a random floating point number in the unit interval: non-injective ones and injective ones.

As a quick refresher, an IEEE binary32 floating point number ("float") is laid out as follows (the | symbols are just for illustration):

b31 | b30 ... b23 | b22 ... b0

which is interpreted as the number

(-1)^(b31) * 2^([b30 ... b23] - 127) * 1.[b22 .. b0]

where [bx .. by] is the number we get if we interpret the bits bx to by written consecutively as a binary number.

b31 is the sign
[b30 ... b23] is the exponent, which is biased by -127
[b22 ... b0] is the mantissa

Float is the set of all representable IEEE binary32 floating point numbers. Let's refer to the subset of Float that represents the range [0,1] as UnitFloat.

What form do the numbers in UnitFloat take? They are:

the non-negative floats smaller than one: all those floats with b31 = 0 and exponent < 0, that is, [b30 ... b23] < 127; this includes, as a special case, the float interpreted as zero
the float interpreted as one: b31 = 0, [b30 ... b23] = 127 and [b22 ... b0] = 0

Non-injective (`Word32 -> UnitFloat`)

Constant exponent

The method used right now is this:

(castWord32ToFloat $ (w32 `unsafeShiftR` 9) .|. 0x3f800000) - 1.0

where w32 is a random Word32.

What is happening here? This sets the lowest 32-9=23 bits, i.e. the mantissa, to random bits from the Word32. It also sets the upper bits to 0x3f8, which is 0|01111111:

b31 = 0 so the sign is positive
[b30 ... b23] = 01111111 = 127, so the exponent is 127 - 127 = 0.

In other words, this method generates numbers of the form

(-1)^0 * 2^0 * 1.[b22 ... b0] = 1.[b22 ... b0]

These numbers all lie in [1, 2):

if b22 to b0 are all zero, we get 1, so 1 is included
if b22 to b0 are all one, we get close to 2, but we never hit it

By subtracting by 1.0 we get numbers in the range [0, 1).

AFAICT the v1.1 method does something very similar.

Multiplication

mwc-random uses multiplication and addition to scale a Word32 into the (0,1] range.

Comparison

Fortunately it is possible to simply enumerate all Word32. This is what I've done in the following.

full is the cardinality of UnitFloat
count is the cardinality of the image of the respective conversion functions
percentage hit is count / full, expressed as a percentage

random: word32ToFloatInUnitInterval
full  = 1065353218
count = 8388608
percentage hit = 0.787%
includes 0 = 1
includes 1 = 0

mwc-random: wordToFloat
full  = 1065353218
count = 25165825
percentage hit = 2.362%
includes 0 = 0
includes 1 = 1

downey baseline: wordToFloatByDivision
full  = 1065353218
count = 83886081
percentage hit = 7.874%
includes 0 = 1
includes 1 = 1

downey baseline: wordToFloatByDivisionDouble
full  = 1065353218
count = 83886081
percentage hit = 7.874%
includes 0 = 1
includes 1 = 1

java: nextFloat/wordToFloatByDivision24
full  = 1065353218
count = 16777216
percentage hit = 1.5748031466499028%
includes 0 = 1
includes 1 = 0

Generated by https://gist.github.com/curiousleo/d8222f63be859f8207e805e71d13095a.

Injective (`Word150 -> UnitFloat`)

What would we need to do to actually hit every possible UnitFloat with the correct probability? Downey [1] shows one method.

Roughly, it amounts to the following:

pick the mantissa bits [b22 ... b0] at random
to pick the biased exponent, compute 126 - n where n is the number of failed Bernoulli trials up to the first success, up to 126. This can be done by drawing a random 126-bit bitstring and counting the number of leading zeros

This maps to every possible float in the range [0, 2^-1*1.[11 ... 11]]. To de-bias the method and include 1, the following correction is made:

if the mantissa is all zeros, then with p=0.5, increase the exponent by one

I won't go into the probabilistic justification for this here, but it should be clear that this makes it possible to hit 1: exactly when the mantissa was all zeros, the unbiased exponent before the correction was -1 and the coin toss results in increasing the exponent by one. Then we get 2^0 = 1.

This process consumes 150 bits of randomness overall (in most cases, a lot less: we can stop the Bernoulli trials once we've had a single success). It is injective: for every UnitFloat, we can find at least one conceptual Word150 that maps to it.

[1] http://allendowney.com/research/rand/

`Double`

The same considerations apply to generating uniformly distributed Double values. Here, Downey's method consumes up to 1022 bits for Bernoulli trials + 52 bits for the mantissa + 1 bit for the exponent correction = 1075 bits.

Unit interval vs. general range

In order to generate a floating point number in an arbitrary interval [lo, hi] with lo < hi, we can generate x in the unit interval and then calculate lo + x * (hi - lo).

Alternatively, it is conceivable to use a variant of Downey's method to directly generate a floating point number uniformly distributed over an arbitrary interval. That is what rademacher-fpl [2] does. Pro: it's "perfect": hitting every representable number in the range with the right probability. Con: it's pretty involved. Generating a floating point number in the unit interval is a very simple special case of the more general procedure.

In order to weigh these options, I wonder how we can analyse the quality of lo + x * (hi - lo): how many representable numbers in the resulting range can we hit in this way? Also, is there a better way to do this calculation, perhaps x * hi - lo * (x - 1), that improves the "hit rate"?

[2] https://gitlab.com/christoph-conrads/rademacher-fpl/-/blob/master/include/rademacher-fpl/impl/uniform-real-distribution.hpp

curiousleo commented 4 years ago

Relevant: https://github.com/idontgetoutmuch/random/issues/113#issuecomment-620579146, which suggests thinking about FP numbers as triples (sign, exponent, significand) explicitly.

curiousleo commented 4 years ago

Benchmarks for the different methods, as implemented here: https://gist.github.com/curiousleo/d8222f63be859f8207e805e71d13095a

word32ToFloatInUnitInterval:
pure/random/Float                        mean 334.2 μs  ( +- 10.47 μs  )
pure/uniformR/unbounded/Float            mean 335.0 μs  ( +- 7.438 μs  )

wordToFloat:
pure/random/Float                        mean 28.78 μs  ( +- 2.373 μs  )
pure/uniformR/unbounded/Float            mean 28.63 μs  ( +- 2.400 μs  )

wordToFloatByDivision:
pure/random/Float                        mean 28.20 μs  ( +- 2.218 μs  )
pure/uniformR/unbounded/Float            mean 27.62 μs  ( +- 950.8 ns  )

It appears that of these "simple" methods (non-Downey-inspired), wordToFloatByDivision has both the largest coverage (see issue description) and is no slower than any other method.

idontgetoutmuch commented 4 years ago

I think this says we should go with the method used in MWC i.e. select 32 bits convert to double, multiply by 2^-23 - this is a compromise between performance and coverage and much better than is in 1.1.

Also simplifies the code base as we can delete CastFloatWord.cmm

curiousleo commented 4 years ago

I think this says we should go with the method used in MWC i.e. select 32 bits convert to double, multiply by 2^-23 - this is a compromise between performance and coverage and much better than is in 1.1.

Also simplifies the code base as we can delete CastFloatWord.cmm

What is the advantage of mwc-random's wordToFloat over wordToFloatByDivision? They have the same performance (within each other's stddev), but wordToFloatByDivision has a larger coverage.

idontgetoutmuch commented 4 years ago

Oh sorry I thought they were the same but let's use the one with better coverage.

curiousleo commented 4 years ago

Based on @christoph-conrads' comments on https://github.com/idontgetoutmuch/random/issues/113, I've started looking into shortcomings of the translation approach, i.e. "given a ≤ b, generate x in the unit interval and return a + x * (b - a)", using the marvelous sbv package. Here's what I've found so far:

Translating Float

∀ a b x. a ≤ b & 0 ≤ x ≤ 1 ⇒ a ≤ a + x × (b - a)
Falsifiable. Counter-example:
  a               = -1.7159568e38 :: Float
  b               =  1.7159568e38 :: Float
  x               =           0.0 :: Float
  a + x * (b - a) =           NaN :: Float

∀ a b x. a ≤ b & 0 < x < 1 ⇒ a ≤ a + x × (b - a)
Falsifiable. Counter-example:
  a               =    -Infinity :: Float
  b               = -4.2949673e9 :: Float
  x               = 4.656614e-10 :: Float
  a + x * (b - a) =          NaN :: Float

∀ a b x. a ≤ b & 0 ≤ x ≤ 1 ⇒ a + x × (b - a) ≤ b
Falsifiable. Counter-example:
  a               = -4.7021254e-38 :: Float
  b               =     -1.481e-42 :: Float
  x               =            1.0 :: Float
  a + x * (b - a) =      -1.48e-42 :: Float

∀ a b x. a ≤ b & 0 < x < 1 ⇒ a + x × (b - a) ≤ b
[still running ...]

Translating Double

∀ a b x. a ≤ b & 0 ≤ x ≤ 1 ⇒ a ≤ a + x × (b - a)
Falsifiable. Counter-example:
  a               = -3.511119404031257e305 :: Double
  b               = 1.7941820154582846e308 :: Double
  x               =                    0.0 :: Double
  a + x * (b - a) =                    NaN :: Double

∀ a b x. a ≤ b & 0 < x < 1 ⇒ a ≤ a + x × (b - a)
Falsifiable. Counter-example:
  a               =              -Infinity :: Double
  b               = -1.39009830559643e-309 :: Double
  x               = 2.786775579063254e-309 :: Double
  a + x * (b - a) =                    NaN :: Double

∀ a b x. a ≤ b & 0 ≤ x ≤ 1 ⇒ a + x × (b - a) ≤ b
Falsifiable. Counter-example:
  a               =   -1.84467440737107e19 :: Double
  b               = -2.3364633538570005e12 :: Double
  x               =                    1.0 :: Double
  a + x * (b - a) =     -2.336463353856e12 :: Double

∀ a b x. a ≤ b & 0 < x < 1 ⇒ a + x × (b - a) ≤ b
[still running ...]

christoph-conrads commented 4 years ago

Based on @christoph-conrads' comments on #113, I've started looking into shortcomings of the translation approach, i.e. "given a ≤ b, generate x in the unit interval and return a + x * (b - a)", using the marvelous sbv package. Here's what I've found so far:

My comments are misleading in that they obscure the fundamental reason why the affine transformation a + x * (b-a) cannot work.

When computing random integers, it is well known that it may be necessary to reject certain random number generator (RNG) variates. For example, given an RNG computing values 0, 1, 2 with equal probability, the only way to select between an apple and a pear with 50% probability is by

taking the apple if the RNG output is 0,
taking the pear if the RNG output is 1,
trying again if the RNG output is 2.

Floating numbers are a finite set and so we must reject values, too. As an example, you are given an RNG with output in [0,1) and you want uniformly distributed values in [0, 0.75) with floating-point numbers. To simplify the problem, consider a FP number with two signficand bits such that you have the FP values 0, 1/8, 2/8, 3/8, 4/8, 6/8 in [0, 1). Because of the interval definition, the value 0.5 should be drawn with probability 1/3.

Here is the key question: How do you map the available set of numbers to [0, 0.75) to achieve this?

To help you, here is a table with the probabilities.

FP value	Probability	Exponent
0	1/8	0
1/8	1/8	0
2/8	1/8	1
3/8	1/8	1
4/8	1/4	2
6/8	1/4	2

There is no way to sum up a finite number of values 2^-k to yield exactly 1/3. So the correct answer is: Draw from the RNG and redraw if the output is >=0.75.

This leads me to a nice conjecture:

Given floating-point numbers a, b, x (b-a) + a, a ≥ 0, 0 ≤ x < 1 uniformly distributed, yields a uniform distribution if and only if round-toward zero is used and b-a is a power of two (in real arithmetic).

Algorithm:

def uniform(a, b):
    assert a <= b

    if b <= 0:
        return -uniform(-b, -a)

    if a < 0 and b > 0:
        x = next_larger_power_of_two(abs(a))
        y = next_larger_power_of_two(abs(b))
        u = uniform(0, x) with probability x/(x+y), uniform(0, y) otherwise

        check if u is in [a,b], try again if not

        return u

    while True:
        x = random01()
        d = b-a with upward rounding
        e = next_larger_power_of_two(d)

        u = x * e + a with rounding toward zero

        if a <= u and u <= b:
            return u

curiousleo commented 4 years ago

@christoph-conrads thank you so much for this clear explanation! When I was trying to come up with an algorithm, I was thinking along similar lines but with much less clarity :)

I believe that the code has 1.5 bugs:

if b = 0 then it will loop forever, unless I'm missing some subtlety related to the different zeros (in this case, was the intention that -0 < +0? According to Wikipedia, that is not the case ...)

in the a < 0 && 0 < b case, I believe there is a sign mistake:

- u = uniform(0, x) with probability x/(x+y), uniform(0, y) otherwise
+ u = -uniform(0, x) with probability x/(x+y), uniform(0, y) otherwise

christoph-conrads commented 4 years ago

if b = 0 then it will loop forever, unless I'm missing some subtlety related to the different zeros (in this case, was the intention that -0 < +0? According to Wikipedia, that is not the case ...)

Yes, this should read as signbit(b) == 1.

in the a < 0 && 0 < b case, I believe there is a sign mistake:

Yes.

def uniform(a, b):
    assert a <= b

    if signbit(b) == 1:
        return -uniform(-b, -a)

    if a < 0 and b > 0:
        x = next_larger_or_equal_power_of_two(abs(a))
        y = next_larger_or_equal_power_of_two(abs(b))
        u = -uniform(0, x) with probability x/(x+y), uniform(0, y) otherwise

        check if u is in [a,b], try again if not

        return u

    while True:
        x = random01()
        d = b-a with upward rounding
        e = next_larger_or_equal_power_of_two(d)

        u = x * e + a with rounding toward zero

        if a <= u and u <= b:
            return u

idontgetoutmuch / random

Floating point numbers #105

`Float` representation

Non-injective (`Word32 -> UnitFloat`)

Constant exponent

Multiplication

Comparison

Injective (`Word150 -> UnitFloat`)

`Double`

Unit interval vs. general range

idontgetoutmuch / random

Floating point numbers #105

Float representation

Non-injective (Word32 -> UnitFloat)

Constant exponent

Multiplication

Comparison

Injective (Word150 -> UnitFloat)

Double

Unit interval vs. general range

`Float` representation

Non-injective (`Word32 -> UnitFloat`)

Injective (`Word150 -> UnitFloat`)

`Double`