Sampling uncollapsed beta bernoulli with small (but not too small) pseudocounts

axch commented 8 years ago

Consider that the region in which a double-precision floating point number rounds to 1 is not that small:

1 - 0.556e-16
0.9999999999999999

1 - 0.555e-16
1.0

and therefore getting an exact 1 from a beta with moderately small pseudocounts is not that unlikely:

import scipy.stats as s

s.beta.cdf(0.555e-16, 0.1, 0.1)
0.012012695220665105

len(filter(lambda x: x == 1, list(s.beta.rvs(0.1, 0.1, size=100000))))
1255

If flip then reports the log density of getting a False from a coin flipped with such a weight, it is liable to say -inf rather than the perfectly plausible log(1e-17) = -39, which is a rather catastrophic loss of precision. This then leads to various follow-on problems, notably gross errors in the average-of-logs (not logavgexp!) that occurs in @marcoct's subjective divergence computations.

What might we do about this?

Leave it and suffer. Note that using the collapsed make_beta_bernoulli SP circumvents this problem because the beta is never sampled, and the uncollapsed but still conjugate make_uc_beta_bernoulli may circumvent some of the problems if queried via its marginalLogDensityOfData (which marginalizes the latent weight out).
Introduce a new value type, called, say, VentureProbability, that internally represents probabilities in log-odds space, thus ensuring good resolution near 0 and 1.
- Teach beta to emit these, which involves some numeric analysis to make a good sampler for beta in that output representation
- Teach all possible consumers to treat these appropriately
  - Arithmetic and comparison will need to do something sensible
  - Flip can report its log density with good precision
Introduce a new SP called, say, log_odds_beta, which returns a VentureNumber in log odds. Introduce a corresponding log_odds_flip.
- Still have to made a good sampler for log_odds_beta.
- This would not be so foreign, as we already have log_flip and log_bernoulli, which are for addressing the same problem but backwards (probability of true for a weight near zero).

[edited by @riastradh-probcomp to number the options]

axch commented 8 years ago

Betas with small pseudocounts are useful for forcing boolean models of various stripes into regimes with very peaky posteriors. Very peaky posteriors are useful for debugging the models and one's inference schemes.

riastradh-probcomp commented 8 years ago

Isn't there already a probability type?

The second and third options don't appear to be exclusive.

axch commented 8 years ago

There is a ProbabilityType, which (currently) expects its elements to represented as a VentureNumber between 0 and 1.

riastradh-probcomp commented 8 years ago

It turns out that sampling from logit(Beta(a, b)) is actually quite easy for all values of a and b, because sampling from Beta(a, b) entails computing x/(x + y) for some x and y, which may be not even directly representable but only logarithmically representable themselves, and logit(x/(x + y)) = log x - log y. So we can easily adapt the algorithms

x ~ Gamma(a)
y ~ Gamma(b)
return x/(x + y),

for a > 1 or b > 1, and

u ~ U[0, 1]
v ~ U[0, 1]
x = u^(1/alpha)
y = v^(1/beta)
reject unless x + y <= 1
return x/(x + y)

for a, b <= 1 (Johnk's algorithm), to log-odds space. Full Python sketch:

def logit_beta(prng, alpha, beta):
    inf = float('inf')
    if alpha == 0 and beta == 0:
        return -inf if uniform64(prng) & 1 else +inf
    elif alpha == 0:
        return +inf
    elif beta == 0:
        return -inf
    elif min(alpha, beta) < 1e-300:
        return -inf if uniform_01(prng) < alpha/(alpha + beta) else +inf

    if 1 < alpha or 1 < beta:
        log_G = standard_loggamma(prng, alpha)
        log_H = standard_loggamma(prng, beta)
        return log_G - log_H

    while True:
        u = uniform_01(prng)
        v = uniform_01(prng)
        log_x = math.log(u)/alpha
        log_y = math.log(v)/beta
        if logsumexp([log_x, log_y]) <= 0:
            return log_x - log_y

Finally, a Bernoulli sampler with a log-odds space weight is also easy -- if p is in direct space and x its corresponding value in log-odds space, then instead of uniform_01() < p, just compute logit(uniform_01()) < x, or uniform_01() < logistic(x). I believe both logit and logistic can be easily computed efficiently with high precision -- I expect the definitions we have in backend/lite/utils.py are good enough.

axch commented 8 years ago

Is Johnk's algorithm still necessary in logit space? I see it simplifies a great deal, which is pleasant :)

Also, do we still need the min(alpha, beta) < 1e-300 clause?

riastradh-probcomp commented 8 years ago

Provided we have a good log Gamma sampler, it appears that Johnk's algorithm may not be necessary in logit space.

Given a Gamma sampler with shape > 1, I think this is a good log Gamma sampler for shape < 1:

G = standard_gamma(prng, shape + 1)
U = uniform_01(prng)
return math.log(G) + math.log(U)/shape

(See Devroye, IX.3.6, 'Gamma variate generators when a < 1', p. 420, option (3), http://www.nrbook.com/devroye/Devroye_files/chapter_nine.pdf#42; then expand out the log of G*U^(1/alpha).)

The 1e-300 clause is necessary to avoid overflow in computing log(U)/shape. I don't see a better way around it. Note that when 2^-256 <= U <= 1, |log(U)| is bounded below by 1e3, so if shape >= 1e-300, then |log(U)/shape| <= 1e3*1e300 = 1e303. Overflow happens a little past 1e308.

However, we still want to consider min(alpha, beta) < 1 separately from min(alpha, beta) >= 1:

If shape <<< 1, then samples from G ~ Gamma(shape) may be rounded to zero where samples from log G ~ log Gamma(shape) would not have overflowed to -infinity. (For shape > 1e-300, log G will never overflow to -infinity.) Hence when min(alpha, beta) < 1 it is better to sample from a log Gamma distribution and subtract logs. I don't intuitively expect to get values close in magnitude in that case, but maybe that needs a little more analysis.
If alpha and beta are large and close, samples from log Gamma(alpha) and log Gamma(beta) will also be large and close, leading to catastrophic cancellation in the difference log G - log H. But when alpha, beta >= 1, we can sample from Gamma(alpha) and Gamma(beta) directly and never get zero, so we can just compute log G/H in that case.

riastradh-probcomp commented 8 years ago

Further justification for splitting those two cases: The Gamma(alpha) distribution has mean alpha and standard deviation sqrt(alpha), so for very small alpha < 1 there is a wide spread relative to the magnitude of the mean and hence low danger of catastrophic cancellation, whereas for very large alpha > 1 there is a narrow spread relative to the magnitude of the mean and hence high danger of catastrophic cancellation.

Formalizing this 'danger of catastrophic cancellation' is left as an exercise for the reader, but this loose argument is enough to make me stop worrying about it.

riastradh-probcomp commented 8 years ago

To elaborate on the 1e-300 case: Although the log Gamma sampler I sketched above will gracefully handle all shapes, giving the correct -infinity on overflow, the difference of two infinities will be NaN, which is not something we want to yield. So if the log Gamma sampler would almost certainly overflow, then we need some way to pick one of the two infinite results.

riastradh-probcomp commented 8 years ago

Thoughts/arguments/satanic advocacy about option (2) vs option (3)?

Option (2) advantages:

Fewer SPs.
Generally it looks like working in log-odds space is a win for correctness, so naive programs will work more reliably.

Option (3) advantages:

No need to figure out how new types figure into the language.
Maybe better performance for programs that draw a Beta sample once and then many Bernoulli flips using that Beta sample.

(The two options remain non-exclusive, of course.)

riastradh-probcomp commented 8 years ago

Hm. Remind me why we can't just use log-space here? We lacked a log-space beta sampler as much as we lacked a log-odds-space beta sampler. The only advantage to log-odds-space, I think, is that inverting probabilities is a sign change rather than a log1p(-exp(x)), though I haven't thought through this very far yet.

axch commented 8 years ago

Hm. Now that you mention it, I think log-space would serve the original purpose adequately, as log(1) is 0 and so log-space provides adequate floating point density for high-probability events. It would be more symmetric with dirichlet (because there is no natural log-odds there). Positive feature: fewer new SPs, since we already have log_flip and log_bernoulli.

riastradh-probcomp commented 8 years ago

To answer my question: Why can't we just use log-space here? Consider alpha = beta = 0.001, which is small but not really all that small (milliscopic but not microscopic, one might say!). Let p ~ Beta(alpha, beta), and x = log p. Any x in about (-10^-300, 0) is rounded to zero; hence any p in (e^{-10^-300}, 1) is rounded to zero in log space. The mass under Beta(alpha, beta) in that interval is the same as the mass under Beta(beta, alpha) of (0, 1 - e^{-10^-300}), which is F(1 - e^{-10^-300}) = F(-expm1(-10^-300)) ~= F(10^-300) ~= 1/4, where F is the Beta(alpha, beta) CDF as evaluated by scipy 0.13.3.

In brief, for alpha = beta = 0.001, there is a 1/4 chance of getting a log-beta sample rounded to zero. That's better than direct space -- with the same parameters, there's about a 48% chance of getting a beta sample rounded to one. But still quite high.

What about log-odds? Let p ~ Beta(alpha, beta), and y = logit(p). Any y in about (10^300, +\infty) overflows to +\infty; hence any p in (logistic(10^300), +\infty) overflows to +\infty in log-odds space. The probability of this is, again, F(1 - logistic(10^300)) = F(logistic(-10^300)) = F(1/(1 + e^{10^300})) ~= F(e^{-10^300}) ~= F(10^{-5*10^301}) < F(10^{-10^300}).

This number, 10^{-10^300}, is so tiny I can't hope to get scipy to evaluate F at it exactly, and I'm not really keen to whip out a 10^300-digit numerical integrator to do it myself. Approximating bounds on it -- in particular, ascertaining whether it is smaller than 2^-256 -- is left as an exercise for the numerically inclined reader. But at least this resolution is clearly better than log-space, and the probability of getting a sample this small is definitely less than 10^-6 -- of the millions or billions of points I've sampled from a log-odds-beta distribution, I haven't ever seen a single value overflowed to +\infty.

Here's an approximate summary in direct-space of the ranges of direct-space, log-space, and log-odds-space in IEEE 754 double-precision floating-point that won't get rounded to 0 or 1:

Direct-space: [10^-300, 1 - 10^-16]
Log-space: [10^{-10^300}, 1 - 10^-300]
Log-odds-space: [10^{-10^300}, 1 - 10^{-10^300}].

If you find it hard to feel intuition for numbers at these scales, you're not alone.

riastradh-probcomp commented 8 years ago

@marcoct, Can you say anything about the distribution on the parameters alpha and beta that you expect to need to handle? If never smaller than 0.1, using log_beta should be fine with the pull request #611. If much smaller than 0.1, we'll need to implement log_odds_beta and log_odds_bernoulli (for which I have a draft but it needs polish).

marcoct commented 8 years ago

A beta(0.1, 0.1) seems to me like a typical initial choice for a very sparse bernoulli model. I discovered this issue in a somewhat extreme context, when trying to force sparsity in a beta-Bernoulli model with a small dataset (so I had to set the alpha and beta fairly low to get the desired sparsity in the posterior, i.e. have two extremely biased coins explain the data instead of one balanced coin). In typical settings, I wouldn't expect setting priors that are much more extreme than this, but that's mostly a guess. If we put broad hyper-priors on the alpha and beta, it seems possible that alpha and beta may be driven below 0.1. I'm not sure how frequently this will occur in practice though.

axch commented 7 years ago

We have effectively accepted Option 3. Option 2 remains a possibility, but not enough to keep the ticket open, in my opinion.

probcomp / Venturecxx

Sampling uncollapsed beta bernoulli with small (but not too small) pseudocounts #606