popgenmethods / momi2

Infer demographic history with the Moran model
GNU General Public License v3.0
47 stars 11 forks source link

How to determine the best model #27

Closed silvewheat closed 5 years ago

silvewheat commented 5 years ago

Hello, I read the "Goodness-of-fit" in the document, but still not sure the process to determine which model is the best. As far as I see, we should:

  1. Compare the 'log_likelihood' value, bigger is better.
  2. Compare the "SfsModelFitStats.all_pairs_ibs", the Z value should close to zero (is it right?).
  3. If gene flow included in the model, it should be validated by "SfsModelFitStats.f4" (see only Z(Expected-Observed) ?).

Is that right?

Best,

jackkamm commented 5 years ago
  1. Compare the 'log_likelihood' value, bigger is better.

Bigger log-likelihood means a better model -- but only when the number of parameters is fixed

Adding more parameters to the model will increase the log-likelihood, even if the simpler model is correct. This is a form of overfitting. For example, see this wikipedia page for some background on related issues: https://en.wikipedia.org/wiki/Akaike_information_criterion

To prevent overfitting, we should favor the simplest model that agrees with the data, not necessarily the model with the best log-likelihood. For example, you can check the summary statistics all_pairs_ibs and f4 to see if the observed values are compatible with the model.

An alternative approach is to use the AIC linked above, or its relative the BIC (Bayes information criteria). This involves adding a simple penalty to the log-likelihood function based on the number of parameters.