mml-book / mml-book.github.io

Companion webpage to the book "Mathematics For Machine Learning"
13.24k stars 2.44k forks source link

Probability of observing x = 0 #643

Open CL-BZH opened 3 years ago

CL-BZH commented 3 years ago

Describe the mistake Page 266 it is written "the likelihood provides the probability of observing x". This is somehow confusing since for a continuous r.v. P(X=x) = 0.

Location Please provide the

  1. version: Draft (2021-01-14)
  2. Chapter: 8.3.1 Maximum Likelihood Estimation
  3. page: 266
  4. line number: 9

Proposed solution Well, I have no easy solution to propose since making sense of the MLE for non-discrete case is not trivial. Maybe advice people to read Michael Evans' book "Measuring Statistical Evidence Using Relative Belief" (at least chapter "1.4 Infinity and Continuity in Statistics"). The book (available online) from Michael J. Evans and Jeffrey S. Rosenthal "Probability and Statistics - The Science of Uncertainty" is also pretty nice (and easier than the one mentioned above). See chapter 6: http://www.utstat.toronto.edu/mikevans/jeffrosenthal/chap6.pdf

Additional context Add any other context about the problem here.

AlbertoGuastalla commented 3 years ago

I think they mean the probability density function (pdf) here.

mpd37 commented 3 years ago

That's a good point.

What about the following:

In other words, once we have chosen the type of function we want as a predictor, the likelihood is the probability density function of the observed data x given \theta.

CL-BZH commented 3 years ago

Hi Marc,

Thanks a lot for your reply.

I'm sorry but “the likelihood is the probability density function of the observed data x given θ” sounds weird to me. For observed data, there is no pdf (only a value).

I also disagree with the sentence in the next paragraph, “It tells us how likely a particular setting of θ is for the observations x”. The likelihood L(θ) doesn’t indicate a level of trust in a particular value of θ. Whether L(θ) value is small or big doesn’t indicate how likely θ is. It is the fact that for a given θ = θ1, L(θ1) > L(θ) for any other value of θ, that gives support to this particular value θ1.

So, I think that it should be emphasis that likelihood is about interpreting an order imposed on the value θ. For the discrete case, it is easy, it is just P(x | θ1) > P(x | θ2). And we seek θ that maximize the observed data x. For the continuous case, we might say that the order on θ is given by the integral of p(x|θ) on small intervals containing x. (Maybe, it would help to mention that for continuous variable (e.g. the speed of a vehicle) an observed value is always an average (an integral).)

Sorry that I cannot come up with a good solution. Best regards, Chris.

mpd37 commented 3 years ago

Good points. I would, however, still want to point out that the likelihood is a pdf (but not in \theta); in that particular paragraph, the data is considered a random variable. However, for a given dataset, it is just a value (as you said). The model comparison point you make is good.

What do you think of the following: "... It is a distribution that models the uncertainty of the data for a given parameter setting. For a given dataset x, the likelihood allows us to express preferences about different settings of the parameters θ, and we can choose the setting that more ``likely'' has generated the data.

CL-BZH commented 3 years ago

I think that is good.

I would like to add something. Paragraph 8.3.1 is pretty confusing (sorry). If we think about someone new to the topic it is hard to know when x is a random variable or the observed data. There is this sentence that defines the negative log-likelihood, "For data represented by a random variable x and for a family of probability densities p(x|θ)...", that sounds wrong to me. Likelihood function is defined once data are observed (i.e. x is the value of the r.v. X. In Lx(θ) there is no random variable anymore).

Below I re-wrote it using X for r.v and x for data and adding "once data are observed (X=x)".

For data represented by a random variable X and for a family of probability densities p(X|θ) parametrized by θ, once data are observed (i.e. X=x), the negative log-likelihood is given by Lx(θ) = −logp(x|θ)

The same way I replaced p(x|θ) by p(X|θ) in the sentence "Let us interpret what the probability density p(X|θ) is modeling for a fixed value of θ".