Not sure the "Bayesian way of thinking" is clear from the previous examples. The previous chapter shows how to use Bayes' theorem, but not too much about using distribution to model uncertainty. This is a good chapter to show that
Nitpick, generalize statement to non-scientists
[x] done
They offer scientists an easy way to define probability models and solve them automatically.
Remove sentence
Do we know anything else? Let’s skip that question for the moment and suppose we don’t know anything else about p. This complete uncertainty also constitutes information we can incorporate into our model. How so? Because we can assign equal probability to each value of p while assigning 0 probability to the remaining values. This just means we don’t know anything and that every outcome is equally likely. Translating this into a probability distribution, it means that we are going to use a uniform prior distribution for p, and the function domain will be all numbers between 0 and 1.
Change label or omit values and labels on the y-axis
[x] done
Change :
ylabel!("Probability")
Into:
ylabel!("Probability density")
Also in Chapter 2 you used the pdf to represent distributions but now you are using a histogram and a random sample. This may need some further explanation
Potentially confusing sentence for newcomers
[x] done
When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function.
may be redundant
[x] done
Change:
This function just tells us, given some chosen value of p, how likely it is that our data is generated by the Bernoulli distribution with that value of p.
Into:
This function just tells us how likely it is that our data follows the Bernoulli distribution given some chosen value of p.
Very confusing
[x] done
You just let randomness make it’s choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it’s particular randomness flavor. Computing the value of this likelihood function for a big number of p samples taken from our prior distribution, gives us the posterior distribution of p given our data.
The value of p is determined by the model and data. Saying that it is determined or chosen by the randomness introduced by MCMC methods is very weird, and will generate a lot of misunderstanding. Maybe is saying something different? I am not following, what it means "multiple types of randomnes" or "randomnes". How randomness is making choices?
Most common MCMC methods I am aware of and their implementations never sample from the prior, instead, they sample from a sampling distribution that it is later corrected to asymptotically follow the posterior. Thus, we can also say we sample from the posterior. There are some methods that do/may sample from the prior like Sequential Monte Carlo, so I am not saying that is not possible, just saying is not the most common scenario.
I would suggest writing something like this:
How do we compute p? There are many methods from pen and paper to many different numerical methods. But probably the most common way is to use a Markov chain Monte Carlo (MCMC) algorithm. This is actually a family of methods, most PPLs implement at least one of those. [if the rest of the book uses a single sampler, INSERT a mention here]. The important practical aspect we need to know at this point is that these methods return samples from the posterior distribution, and we get to answer our questions by operating over those samples.
Change label or omit values and labels on y-axis
[x] done
Change :
ylabel!("Probability")
Into:
ylabel!("Probability density")
Discus uncertainty
There are a couple of mentions of uncertainty related to priors, but nothing about posteriors. Discuss this as least in terms of "how wide" the distributions are, and/or consider more formal expressions like the standard deviation, and Highest Density Interval.
mentioning something about sampling convergence (it may be just a reference to some external source)
maybe make a connection with chapter 2, where you said that to compute probabilities from a pdf we need to integrate. Show that an advantage of having samples is that we can just sum. Show with practical exampling computing the probability larger than 0.5 or between 0.4 and 0.6 or whatever.
Bayesian thinking
Not sure the "Bayesian way of thinking" is clear from the previous examples. The previous chapter shows how to use Bayes' theorem, but not too much about using distribution to model uncertainty. This is a good chapter to show that
Nitpick, generalize statement to non-scientists
They offer scientists an easy way to define probability models and solve them automatically.
Remove sentence
Do we know anything else? Let’s skip that question for the moment and suppose we don’t know anything else about p. This complete uncertainty also constitutes information we can incorporate into our model. How so? Because we can assign equal probability to each value of p while assigning 0 probability to the remaining values. This just means we don’t know anything and that every outcome is equally likely. Translating this into a probability distribution, it means that we are going to use a uniform prior distribution for p, and the function domain will be all numbers between 0 and 1.
Change label or omit values and labels on the y-axis
Change : ylabel!("Probability")
Into:
Also in Chapter 2 you used the pdf to represent distributions but now you are using a histogram and a random sample. This may need some further explanation
Potentially confusing sentence for newcomers
When we perform our experiment, the outcomes will be registered and in conjunction with the Bernoulli distribution we proposed as a generator of our data, we will get our Likelihood function.
may be redundant
Change: This function just tells us, given some chosen value of p, how likely it is that our data is generated by the Bernoulli distribution with that value of p.
Into: This function just tells us how likely it is that our data follows the Bernoulli distribution given some chosen value of p.
Very confusing
You just let randomness make it’s choice. But there exist multiple types of randomness, and this is where our prior distribution makes its play. We let her make the decision, depending on it’s particular randomness flavor. Computing the value of this likelihood function for a big number of p samples taken from our prior distribution, gives us the posterior distribution of p given our data.
The value of p is determined by the model and data. Saying that it is determined or chosen by the randomness introduced by MCMC methods is very weird, and will generate a lot of misunderstanding. Maybe is saying something different? I am not following, what it means "multiple types of randomnes" or "randomnes". How randomness is making choices?
Most common MCMC methods I am aware of and their implementations never sample from the prior, instead, they sample from a sampling distribution that it is later corrected to asymptotically follow the posterior. Thus, we can also say we sample from the posterior. There are some methods that do/may sample from the prior like Sequential Monte Carlo, so I am not saying that is not possible, just saying is not the most common scenario.
I would suggest writing something like this:
How do we compute p? There are many methods from pen and paper to many different numerical methods. But probably the most common way is to use a Markov chain Monte Carlo (MCMC) algorithm. This is actually a family of methods, most PPLs implement at least one of those. [if the rest of the book uses a single sampler, INSERT a mention here]. The important practical aspect we need to know at this point is that these methods return samples from the posterior distribution, and we get to answer our questions by operating over those samples.
Change label or omit values and labels on y-axis
Change : ylabel!("Probability")
Into:
Discus uncertainty
There are a couple of mentions of uncertainty related to priors, but nothing about posteriors. Discuss this as least in terms of "how wide" the distributions are, and/or consider more formal expressions like the standard deviation, and Highest Density Interval.
Have you considered?