popgenmethods / momi2

Infer demographic history with the Moran model
GNU General Public License v3.0
47 stars 11 forks source link

Model choice--AICc #71

Open AndreMonc opened 12 months ago

AndreMonc commented 12 months ago

Although model choice is one step beyond the Momi2 pipeline, I suspect this question will be relevant to many Momi2 users. Does the sample size term "n" in the AICc formula refer to number of sites (SNPs) or number of individuals?

jackkamm commented 11 months ago

I am not too well-versed in the theory behind AIC, so take this with a grain of salt. But, I think that "n" would be the number of SNPs, since the log-likelihood is a sum over SNPs, so it seems natural to consider each SNP as an observation.

HOWEVER, you must remember that the likelihood here is a composite likelihood, not the true likelihood. It makes the approximation that the SNPs are independent. While the MLE will still converge to the correct value (under some assumptions), I am doubtful that the AIC or AICc would be appropriate for a composite likelihood.

jackkamm commented 11 months ago

OK, looking into this more, I think Theorem 1 from this paper shows how to compute AIC for composite likelihood models: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7516723/

In particular, the k in AIC should be replaced by trace(J H^-1) term in Theorem 1. H is the Fisher information and J is the covariance of the score. Both these terms are computed already in momi2/confidence_region.py, e.g. see the godambe() function.