Begin with Bayes - Githubissues

Why begin with Bayesian inference?

This is a deliberate homage to David Robinson's Teach the tidyverse to beginners blog post.
More intuitive.
Even if you do teach frequentist inference, most students don’t get it anyway.

What do I teach?

Teach everything from scratch at a “self-contained conceptual level.”
Motivate with connecting probability for quantifying uncertainty to decision making; a need to understand epistemology -- how we even know things.

Probability theory

The probability of an outcome is a nonnegative number.
The sum of the probabilities of all outcomes is 1.
A probability distribution is a list of all possible outcomes with an associated probability for each outcome.
“Counting the number of ways things can happen, according to our assumptions.” (p. 11, Statistical Rethinking)

Statistical inference

Bayesian updating.
Model/likelihood specification.
Prior and posterior distributions.

Decision Making in the Presence of Uncertainty

We live in a world filled with uncertainty. Will I pass a test? Will it rain tomorrow? Especially as we make decisions, we want to use data to reduce uncertainty. In other words, we use data to make informed decisions. Note that data reduces rather than removes uncertainty. How I perform on the practice test doesn’t guarantee how I’ll perform on the actual exam. The weather this morning tells me something about what I might expect this afternoon, but it’s not a perfect prediction. Even when I use data to inform my decision-making, I will still be making decisions in the presence of uncertainty.

Probability

Start with counting (connect with data summarization of discrete variables)...

Probability is a formal way to quantify uncertainty. A probability for a possible outcome needs to be nonnegative (i.e., zero or a positive number) and the sum of the probabilities for all possible outcomes needs to sum to 1. A probability distribution is a list of all possible outcomes together with the associated probabilities. Not surprisingly, this might be easiest to see with a visualization. Note how the width or variance in a probability distribution is an expression of uncertainty. The more variance in the probability distribution, the more uncertain we are. The greatest expression of uncertainty is that every outcome is equally possible (i.e., a uniform distribution).

Example here to solidify the intuition, introduce formal probability notation.

Statistical Inference

But how do we actually use data to inform decision-making? Sometimes how we use the data is obvious, but oftentimes we want to learn not just about the data (which we observe) but what generated the data (which we don’t observe). We can observe data from a sample of respondents, but how can we use that to learn something about the entire population of respondents? We can observe the choices that people make, but how can we use that to learn something about the process that leads them to make those choices? To understand this unobserved, data-generating process, we need a model (i.e., a likelihood).

Prior to observing data, we have beliefs, a probability distribution over all possible outcomes. Data tells us something about how likely those possible outcomes are. The combination of our prior beliefs and the data (more explicitly, the likelihood of what we observe) results in a posterior probability distribution over the possible outcomes. Note that this is still a probability distribution — again, we have not removed uncertainty — but one that has now been informed by data. We can use this posterior distribution to help us make decisions in the presence of uncertainty.

While the language is new, the process should be intuitive:

We have beliefs about the probability of certain outcomes.
We observe outcomes.
We update our beliefs about the probability of these outcomes.

This updating process is learning! This is how we learn. And that’s all statistical inference is – a formal way to update our beliefs/learn/quantity uncertainty.

Example here to solidify the intuition, introduce Bayes’ theorem.

Models and parameters

Now we need to get into the details about these three components: the prior, the likelihood, and the posterior.

The prior is a probability distribution. If we don’t know anything, we can assume a uniform distribution (or, more generally, something uninformative). However, we usually know something (even if we aren’t really certain about it), and the prior provides us with a formal way to include that knowledge. To connect this to learning, the prior could be the posterior from a previous analysis. Thus, we can see statistical inference as a way to formalize learning not just once but as it really occurs: in sequence, over time.

The likelihood is a story that describes where the data come from. This begins conceptually and then is formalized into a model that tell us the probability of any possible observation. This conceptual model often lives within a literature of model building, motivated by theory, and at its most basic may simply be a consideration of how to relax assumptions in an existing model. In short, the model needs to be consistent with our domain expertise. How do we know we have the right model? The truth is, we don’t. We don’t observe the data generating process. The best we can do is to create new models and compare them. This is science – the endless process of developing and refining theory. The truth is, for most things, simple models work well and there are standard models we’ll start with. However, there are common models that we work with that are best for many situations.

Example of a likelihood and counting (see Statistical Rethinking p. 27, 32); introduce motivations for using normal distributions.

The posterior is the prior distribution, updated according to how likely the data we observe are given our model.

Show how data can overwhelm the prior.

marcdotson / old-blog

Begin with Bayes #20