Phrasing of probability axioms

vasishth / bayescogsci

Draft of book entitled An Introduction to Bayesian Data Analysis for Cognitive Science by Nicenboim, Schad, Vasishth

100 stars 27 forks source link

Phrasing of probability axioms #63

Closed PhilipLoewen closed 1 month ago

PhilipLoewen commented 2 months ago

Early in Section 1.1, the key terms "event" and "sample space" are not defined, and this leaves room for concern regarding the following bullet:

The probabilities of all possible events in the entire sample space must sum up to 1.

If we are rolling a standard 6-sided die, I think "rolling an even number", "rolling an odd number", and "rolling a 6", should all be considered legitimate "events". However, their probabilities (1/2, 1/2, and 1/6, respectively) sum up to a number larger than 1. And there are other events not yet mentioned.

Is there a way to eliminate this misinterpretation while retaining the informal/intuitive tone of this section?

vasishth commented 2 months ago

I will make an attempt to do this. I thought I had defined these terms but maybe I'm mistaken. Thanks for catching this inconsistency.

vasishth commented 2 months ago

I think I'd be forced to become formal from the outset.

I will need to define an outcome by example to keep it connected to reality and not sound like a math textbook that leaves the reader to apply the concepts. For example,

In a coin tossed only once, the possible outcomes are Heads and Tails. In 6-sided die, the possible outcomes in a single toss are 1,2,3,4,5,6. The outcomes of an experiment such as tossing a coin once, or throwing a die once, is called the sample space S. So, for the single-coin-toss experiment, the sample space S is {H,T}; for the die-throwing experiment, the sample space S is {1,2,3,4,5,6}. If a coin-toss experiment consists of a coin tossed three times, then the sample space would be S={TTT,HHH,THH,...}, i.e., all the possible permutations of Heads and Tails that we can get if we toss a coin three times.

An event is defined as a subset of the sample space. For example, when tossing the coin three times, the event of getting 1 heads is the subset of S {HTT,THT,TTH}. When we carry out the experiment of tossing a coin three times, we will say that a particular event (such as the subset {HTT,THT,TTH}) occurred if the outcome of the experiment is a member of that event. So, if HTT occurred in a single experiment, then we say that the event {HTT,THT,TTH} occurred.

One can define a function--call it a probability function P--that assigns a probability (a value between 0 and 1) to each event.

For example, in a single coin toss experiment, P could be P({Heads})=P({Tails}=0.5. The probability of all possible outcomes in the sample space S must sum up to 1.

Would you consider this to be OK?

Thanks for catching this.

PhilipLoewen commented 2 months ago

Thanks for taking my comment seriously (and so soon)!

I support not writing yet another math textbook: the niche you are targeting with this work is clear in the Preface, and I respect it. For this reason, I am trying to think up an even more succinct way to resolve this issue.

Please don't rush into making changes. I spent some time writing the response below, but it does not stand up to critical reflection. Instead of vaporizing it, let me leave it here as a secondary branch. I will try to do better and come back tomorrow.

Your idea of introducing the technical term "outcome" to mean a single point in the sample space would make the bullet-point I queried fixable just by changing the word "event" to "outcome". You might even try postponing the definition of these words by promising the reader that a discussion of the terminology will come later in the same section. (Alternatively, you could move the bulleted axioms into the "math notation" part of the presentation.)

Saying, "An event is defined as a subset of the sample space," sounds a lot like pure math. How about some variant of the following?

An event is a combination of one or more outcomes. For example, when tossing a coin three times, "getting 1 head" is an event that describes each of the outcomes HTT, THT, and TTH. "Getting 3 heads" is an event associated with the single outcome HHH. Mathematically, an event is defined as a subset of the sample space: the examples just mentioned would be written as {HTT, THT, TTH} and {HHH}.

Math hard-liners will make a distinction between the static probability function ${\mathbb P}$ that returns a number for each event and the more dynamic-sounding experimental process of taking samples from the corresponding distribution and thereby producing a random variable $Y$. Thus it may be that you want to have a named random variable at play even at this stage of the presentation. I would need to think more about how to communicate this, but a formula expressing the nuance I have in mind would read something like this:

${\mathbb P}_Y(\lbrace HTT,THT,TTH\rbrace) = {\mathbb P}(Y\in\lbrace HTT,THT,TTH\rbrace)$.

vasishth commented 2 months ago

Thank you for helping me think of a way out of this thorny problem of explaining probability theory axioms correctly without descending into formalism. I have had some 4000+ people take my course on the first five chapters so far, and there have been some two or three people who caught my inconsistencies and outright misleading statements in chapter 1 (which in itself is interesting, the low number). I think there is only one way out: start with an example like three coin tosses and stick with it to motivate why we make a distinction between outcome and event. In my experience, when talking with "outsiders" (non-math people), I can get away with some informality here, and they will not even notice that need for making a distinction between $P$ and $P_Y$. Some hardcore readers even want me to introduce measure theory (albeit informally), but I refuse to do go that far, and from what you wrote, I think you'd agree :).

Just saying that I am going to take some heat no matter how I write this chapter. But I do agree with your comment that I will cause a lot of confusion with the current text in the book and that it needs fixed.

PhilipLoewen commented 2 months ago

How gratifying it must be to have 4000+ satisfied readers!

An alternative to the final bullet would be simply to assert that there exists some event with probability 1.

Since an earlier bullet asserts that all probability values must land in the interval [0,1], this would be decisive. If you need to weave an explanation into the later text, you could say that $Y\in S$ is an event, and the suggested statement implies ${\mathbb P}(Y\in S) = 1$. This is a succinct rendering of the current final bullet.

(Reason [optional]: For any set $A\subset S$, ${\mathbb P}(Y\in A) \le {\mathbb P}(Y\in S) \le 1$, so finding a set $A$ with ${\mathbb P}(Y\in A)=1$ leaves no other option for ${\mathbb P}(Y\in S)$.)

Tangential: It's interesting to see how the math symbols echo the two perspectives on probability theory described earlier in prose. On one hand we have random variables like $Y$ from which we can draw observations and data, hoping to make predictions ("frequentist"), and on the other we have uncertain parameters like $\theta$ that are impossible to observe directly, yet somehow there is insight to be gained by treating these with the tools of probability ("Bayesian"). It might be nice to say something like this late in Section 1.1, as a callback to the themes of the general introduction.

vasishth commented 1 month ago

there is insight to be gained by treating these with the tools of probability ("Bayesian"). It might be nice to say something like this late in Section 1.1, as a callback to the themes of the general introduction.

I think you overestimate the ability of the target readership to absord this subtle point, at least at this stage of the book :).

However, I have adopted your rephrasing of the last bullet point. Thanks for this suggestion!

vasishth commented 1 month ago

@PhilipLoewen here is a revision of section 1.1.

http://bruno.nicenboim.me/bayescogsci/ch-intro.html#introprob

Note that I am side-stepping formalism to the extent I can; it just causes confusion among newcomers.

PhilipLoewen commented 1 month ago

Thanks for inviting my comments. If you have the stomach for a few more changes, let's dive in. I'll split up my submissions to make it easy to respond one by one ... should you choose to.

At some point the text proposes the ordered-triple notation $(\Omega,F,P)$ for a probability space, but earlier in the story the text says we need to understand 4 things to make the definition. The mismatch between 3 and 4 is a little uncomfortable. So I am groping for minor tweaks to the following current text:

Four terms have to be introduced here before we introduce the probability axioms: an outcome ω, a sample space Ω, which is the set of all possible outcomes, an event E, and a set F that contains all possible subsets of the sample space Ω.

How about something like the following? This lays down the big 3 ingredients first, then brings in the subsidiary notation and terminology later. It's longer than the original, but that's because it front-loads some words mentioned later. The overall burden is not heavier (according to me):

The formal axioms of probability involve three ingredients: a general set Ω known as the sample space, a collection of subsets of Ω denoted by F and called the event space, and a real-valued function named P that assigns a number to each set in the family F. The points in Ω are typically denoted ω; these are called outcomes. Sets in the family F are often denoted by E, and called events. The simplest events are the single-element sets of the form {ω} for different choices of ω in Ω: these are the elementary events.

PhilipLoewen commented 1 month ago

It's important indeed not to scare away readers whose primary interest is not mathematics.

Unfortunately, the event space F associated with a particular sample space Ω is a subtle thing. When Ω is a finite set, gathering up all subsets of Ω into a big set is natural and fully legitimate. But there is scope for refinement even here: the technical definition for a conditional probability involves putting the spotlight on a relevant sub-collection of this basic choice. Worse, when Ω is not a finite set, the collection of all subsets is "too big" in a sense that can be made precise. Embedded in the full axioms of probability is the requirement that the event space F in (Ω,F,P) should be a $\sigma$-algebra ... not just an arbitrary bag of subsets selected from Ω.

For the target audience, the goal should be to stay focussed on practicalities while avoiding oversimplification that could introduce errors. So I think we should avoid the following phrase, which does currently appear in the general context:

and a set F that contains all possible subsets of the sample space Ω.

Rather, we should offer a polite nod in the abstract direction and then move on as rapidly as possible. Let me take a shot at this:

When the sample space Ω is finite, any subset of Ω can be an event; in this case, the event space F is the collection of all possible subsets of Ω. For a standard 6-sided die, for example, the sample space is the set Ω={1,2,3,4,5,6} and the event space F will contain $2^6=64$ sets: the empty set, 6 one-element sets, 15 two-element sets, 20 three-element sets, ..., and one six-element set. Other choices of F are compatible with the axioms, and when the set Ω is not finite such subtleties become unavoidable. Here are three mild assumptions we will always make about any event space:

Both the empty set $\emptyset = \lbrace\rbrace$ and universal set $\Omega$ belong to the event space $F$.

If $E$ is an event, then so is the complement of E. (This is the set $E^c = \lbrace \omega\in\Omega :\,\omega\not\in E\rbrace$

For any list of events $A_1,A_2,\ldots$ (finite or infinite), the phrase "$A_1$ or $A_2$ or $\ldots$" describes another event. In symbols, $F$ includes the set $\lbrace{\omega\in\Omega :\, \omega\in A_1\ \text{or}\ \omega\in A_2\ \text{or}\ \ldots}\rbrace$.

For much of this text, it will be safe to rely on the intuition gained in the case where $\Omega$ is a finite set and every subset of $\Omega$ belongs to $F$.

PhilipLoewen commented 1 month ago

In the formal bullet-list of probability axioms, I suggest repeating some of the explanatory terminology. Here's the current writeup:

The probability axioms are as follows. The axioms define a function P that takes as input an element of the set F and returns a real value subject to the following constraints.

For every element of F, the probability P is a real number greater than or equal to zero.

The probability that at least one of the elementary events occurs is 1.

If A1,A2,A3,... are mutually exclusive events (in other words, thesse events have no elements in common), then the probability of A1, or A2, or A3,… occurring is the sum of the probability of A1 occurring, of A2 occurring, of A3 occurring, … (this sum could be finite or infinite).

A probability space is defined as (Ω,F,P).

Here's how they could be rearranged without loss of precision.

The probability axioms refer to the sample space Ω, event space F, and probability P as follows.

For every event E in the event space F, the probability P(E) is a real number between 0 and 1.

The event E=Ω belongs to F, and P(Ω)=1.

If the events $A_1,A_2,A_3,\ldots$ are mutually exclusive (in other words, if no two of these subsets of Ω ever overlap), then the probability of the event "one of A1 or A2 or A3 or …" is given by the sum $P(A_1)+P(A_2)+P(A_3)+\cdots$.

PhilipLoewen commented 1 month ago

Unfortunately I have to leave this discussion now and feed my other projects. I hope something useful can be extracted from the above.

vasishth commented 1 month ago

These are real improvements on my attempt; I adopted most of what you suggested, with minor wording edits, and have acknowledged your help with this (hope it is OK to use your text). The text dangerously verges toward formalism; I think it's very hard to write accessibly for non-mathematicians. Either precision has to be given up, or comprehensibility.