rmcelreath / stat_rethinking_2022

Statistical Rethinking course winter 2022
4.11k stars 444 forks source link

Posterior probability does not integrate to 1 #3

Open monga opened 2 years ago

monga commented 2 years ago

The last line of slide 61 (https://speakerdeck.com/rmcelreath/statistical-rethinking-2022-lecture-02?slide=61) and in the book R code 3.2 (and R code 2.3) uses a standardization rule different from the one used for prior probability.

As explained by the Overthinking box at page 35 of the book, prior is an array of ones, since the important property is that it integrates to one over p_grid. The sum of the values of prior is indeed much greater than 1 (20 in code 2.3, 1000 in code 3.2).

The standardization used for posterior instead guarantees that sum(posterior) == 1, while the integral over p_grid is less than one.

This is not relevant for the shape of the posterior curve, but the asymmetry bothers me. I believe the right statement to use in 3.2 is

posterior <- (posterior / sum(posterior))*length(posterior)

then sum(posterior) == sum(prior) and both their integrals over p_grid should be 1.

rmcelreath commented 2 years ago

Imagine a rectangle with width 1 and height 1. The area is 1x1=1. That is the uniform density from p=0 to p=1.

When you do the grid approximation, you turn the continuous density into a discrete probability mass distribution. That is why you are finding the normalization step necessary to get it to sum to 1. But it is still true that Pr(p)=1 for all values of p for p ~ uniform(0,1).

monga commented 2 years ago

That's fine, thank you. What I want to say, however, is that the code uses two different approaches for the prior and posterior arrays. While for posterior ones we can assume the sum(posterior) == 1 invariant, this is not true for prior arrays. Indeed the invariant is even required by some sampling functions. In python, for example, you could write:

numpy.random.choice(p_grid, size=int(1e4), replace=True, p=posterior)

but

numpy.random.choice(p_grid, size=int(1e4), replace=True, p=prior)

would raise an exception since sum(prior) != 1. (The R sample function seems more tolerant)