Closed axch closed 7 years ago
Betas with small pseudocounts are useful for forcing boolean models of various stripes into regimes with very peaky posteriors. Very peaky posteriors are useful for debugging the models and one's inference schemes.
Isn't there already a probability type?
The second and third options don't appear to be exclusive.
There is a ProbabilityType
, which (currently) expects its elements to represented as a VentureNumber
between 0 and 1.
It turns out that sampling from logit(Beta(a, b)) is actually quite easy for all values of a and b, because sampling from Beta(a, b) entails computing x/(x + y) for some x and y, which may be not even directly representable but only logarithmically representable themselves, and logit(x/(x + y)) = log x - log y. So we can easily adapt the algorithms
x ~ Gamma(a)
y ~ Gamma(b)
return x/(x + y),
for a > 1 or b > 1, and
u ~ U[0, 1]
v ~ U[0, 1]
x = u^(1/alpha)
y = v^(1/beta)
reject unless x + y <= 1
return x/(x + y)
for a, b <= 1 (Johnk's algorithm), to log-odds space. Full Python sketch:
def logit_beta(prng, alpha, beta):
inf = float('inf')
if alpha == 0 and beta == 0:
return -inf if uniform64(prng) & 1 else +inf
elif alpha == 0:
return +inf
elif beta == 0:
return -inf
elif min(alpha, beta) < 1e-300:
return -inf if uniform_01(prng) < alpha/(alpha + beta) else +inf
if 1 < alpha or 1 < beta:
log_G = standard_loggamma(prng, alpha)
log_H = standard_loggamma(prng, beta)
return log_G - log_H
while True:
u = uniform_01(prng)
v = uniform_01(prng)
log_x = math.log(u)/alpha
log_y = math.log(v)/beta
if logsumexp([log_x, log_y]) <= 0:
return log_x - log_y
Finally, a Bernoulli sampler with a log-odds space weight is also easy -- if p is in direct space and x its corresponding value in log-odds space, then instead of uniform_01() < p, just compute logit(uniform_01()) < x, or uniform_01() < logistic(x). I believe both logit and logistic can be easily computed efficiently with high precision -- I expect the definitions we have in backend/lite/utils.py are good enough.
Is Johnk's algorithm still necessary in logit space? I see it simplifies a great deal, which is pleasant :)
Also, do we still need the min(alpha, beta) < 1e-300 clause?
Provided we have a good log Gamma sampler, it appears that Johnk's algorithm may not be necessary in logit space.
Given a Gamma sampler with shape > 1, I think this is a good log Gamma sampler for shape < 1:
G = standard_gamma(prng, shape + 1)
U = uniform_01(prng)
return math.log(G) + math.log(U)/shape
(See Devroye, IX.3.6, 'Gamma variate generators when a < 1', p. 420, option (3), http://www.nrbook.com/devroye/Devroye_files/chapter_nine.pdf#42; then expand out the log of G*U^(1/alpha).)
The 1e-300 clause is necessary to avoid overflow in computing log(U)/shape. I don't see a better way around it. Note that when 2^-256 <= U <= 1, |log(U)| is bounded below by 1e3, so if shape >= 1e-300, then |log(U)/shape| <= 1e3*1e300 = 1e303. Overflow happens a little past 1e308.
However, we still want to consider min(alpha, beta) < 1 separately from min(alpha, beta) >= 1:
Further justification for splitting those two cases: The Gamma(alpha) distribution has mean alpha and standard deviation sqrt(alpha), so for very small alpha < 1 there is a wide spread relative to the magnitude of the mean and hence low danger of catastrophic cancellation, whereas for very large alpha > 1 there is a narrow spread relative to the magnitude of the mean and hence high danger of catastrophic cancellation.
Formalizing this 'danger of catastrophic cancellation' is left as an exercise for the reader, but this loose argument is enough to make me stop worrying about it.
To elaborate on the 1e-300 case: Although the log Gamma sampler I sketched above will gracefully handle all shapes, giving the correct -infinity on overflow, the difference of two infinities will be NaN, which is not something we want to yield. So if the log Gamma sampler would almost certainly overflow, then we need some way to pick one of the two infinite results.
Thoughts/arguments/satanic advocacy about option (2) vs option (3)?
Option (2) advantages:
Option (3) advantages:
(The two options remain non-exclusive, of course.)
Hm. Remind me why we can't just use log-space here? We lacked a log-space beta sampler as much as we lacked a log-odds-space beta sampler. The only advantage to log-odds-space, I think, is that inverting probabilities is a sign change rather than a log1p(-exp(x)), though I haven't thought through this very far yet.
Hm. Now that you mention it, I think log-space would serve the original purpose adequately, as log(1) is 0 and so log-space provides adequate floating point density for high-probability events. It would be more symmetric with dirichlet (because there is no natural log-odds there). Positive feature: fewer new SPs, since we already have log_flip and log_bernoulli.
To answer my question: Why can't we just use log-space here? Consider alpha = beta = 0.001, which is small but not really all that small (milliscopic but not microscopic, one might say!). Let p ~ Beta(alpha, beta), and x = log p. Any x in about (-10^-300, 0) is rounded to zero; hence any p in (e^{-10^-300}, 1) is rounded to zero in log space. The mass under Beta(alpha, beta) in that interval is the same as the mass under Beta(beta, alpha) of (0, 1 - e^{-10^-300}), which is F(1 - e^{-10^-300}) = F(-expm1(-10^-300)) ~= F(10^-300) ~= 1/4, where F is the Beta(alpha, beta) CDF as evaluated by scipy 0.13.3.
In brief, for alpha = beta = 0.001, there is a 1/4 chance of getting a log-beta sample rounded to zero. That's better than direct space -- with the same parameters, there's about a 48% chance of getting a beta sample rounded to one. But still quite high.
What about log-odds? Let p ~ Beta(alpha, beta), and y = logit(p). Any y in about (10^300, +\infty) overflows to +\infty; hence any p in (logistic(10^300), +\infty) overflows to +\infty in log-odds space. The probability of this is, again, F(1 - logistic(10^300)) = F(logistic(-10^300)) = F(1/(1 + e^{10^300})) ~= F(e^{-10^300}) ~= F(10^{-5*10^301}) < F(10^{-10^300}).
This number, 10^{-10^300}, is so tiny I can't hope to get scipy to evaluate F at it exactly, and I'm not really keen to whip out a 10^300-digit numerical integrator to do it myself. Approximating bounds on it -- in particular, ascertaining whether it is smaller than 2^-256 -- is left as an exercise for the numerically inclined reader. But at least this resolution is clearly better than log-space, and the probability of getting a sample this small is definitely less than 10^-6 -- of the millions or billions of points I've sampled from a log-odds-beta distribution, I haven't ever seen a single value overflowed to +\infty.
Here's an approximate summary in direct-space of the ranges of direct-space, log-space, and log-odds-space in IEEE 754 double-precision floating-point that won't get rounded to 0 or 1:
If you find it hard to feel intuition for numbers at these scales, you're not alone.
@marcoct, Can you say anything about the distribution on the parameters alpha and beta that you expect to need to handle? If never smaller than 0.1, using log_beta should be fine with the pull request #611. If much smaller than 0.1, we'll need to implement log_odds_beta and log_odds_bernoulli (for which I have a draft but it needs polish).
A beta(0.1, 0.1)
seems to me like a typical initial choice for a very sparse bernoulli model. I discovered this issue in a somewhat extreme context, when trying to force sparsity in a beta-Bernoulli model with a small dataset (so I had to set the alpha and beta fairly low to get the desired sparsity in the posterior, i.e. have two extremely biased coins explain the data instead of one balanced coin). In typical settings, I wouldn't expect setting priors that are much more extreme than this, but that's mostly a guess. If we put broad hyper-priors on the alpha and beta, it seems possible that alpha and beta may be driven below 0.1. I'm not sure how frequently this will occur in practice though.
We have effectively accepted Option 3. Option 2 remains a possibility, but not enough to keep the ticket open, in my opinion.
Consider that the region in which a double-precision floating point number rounds to 1 is not that small:
and therefore getting an exact 1 from a
beta
with moderately small pseudocounts is not that unlikely:If
flip
then reports the log density of getting a False from a coin flipped with such a weight, it is liable to say-inf
rather than the perfectly plausiblelog(1e-17) = -39
, which is a rather catastrophic loss of precision. This then leads to various follow-on problems, notably gross errors in the average-of-logs (notlogavgexp
!) that occurs in @marcoct's subjective divergence computations.What might we do about this?
make_beta_bernoulli
SP circumvents this problem because the beta is never sampled, and the uncollapsed but still conjugatemake_uc_beta_bernoulli
may circumvent some of the problems if queried via itsmarginalLogDensityOfData
(which marginalizes the latent weight out).VentureProbability
, that internally represents probabilities in log-odds space, thus ensuring good resolution near 0 and 1.beta
to emit these, which involves some numeric analysis to make a good sampler forbeta
in that output representationlog_odds_beta
, which returns aVentureNumber
in log odds. Introduce a correspondinglog_odds_flip
.log_odds_beta
.log_flip
andlog_bernoulli
, which are for addressing the same problem but backwards (probability of true for a weight near zero).[edited by @riastradh-probcomp to number the options]