popgenmethods / momi2

Infer demographic history with the Moran model
GNU General Public License v3.0
47 stars 11 forks source link

Difference in magnitude between expected SFS and SFS from data #64

Open trevorcousins opened 1 year ago

trevorcousins commented 1 year ago

I am analysing the fit of a model inferred by momi2, by looking at its expected SFS compared to the observed SFS (i.e. empirically from the data). I calculate the expected SFS with: expected_sfs = model_sim.expected_sfs() However the sum of expected_sfs is not the same as the sum of the observed SFS from my data. E.g. in my data the number of counts is 1,291,803 but the sum from expected_sfs = 944,199.7

Of course I can normalise each and compare those, which e.g. gives: image

But i wonder, is the fact that expected_sfs != observed_sfs indicative of a problem in my analysis/code?

It was surprising to me also that there was not a better agreement between the two curves; though of course that could indicate a poor choice of model. I realised in the paper and tutorial the fit of the model does not seem to be quanitified by residuals between expected vs observed SFS - is there a reason for this, e.g. do you think f_stats or all_pairs_ibs is more powerful? It is my understanding that using log-likelihood is not a valid metric for comparison due to non-independence of linked sites.

jackkamm commented 1 year ago

However the sum of expected_sfs is not the same as the sum of the observed SFS from my data.

By default, momi2 doesn't normalize against the total number of SNPs; rather, it uses the average number of hets to normalize the SFS.

The reason is that momi2 is meant to handle missing data by default (such as from low-coverage aDNA). With missing data it's hard to compute the expected total number of SNPs; but with some assumptions it's easier to compute how missing data affects the average number of hets.

If you prefer to normalize by total number of SNPs, use set_data(use_pairwise_diffs=False) [1]. However, this only works if you have no missing alleles in any sites.

I realised in the paper and tutorial the fit of the model does not seem to be quanitified by residuals between expected vs observed SFS - is there a reason for this, e.g. do you think f_stats or all_pairs_ibs is more powerful?

For complex models with many samples & populations I found it's difficult to get a really good model fit to the full SFS on real data. But by focusing on the F-stats we can at least try to get a model that fits decently on the first couple moments of the distribution.

If you just have a single population, then I'd suggest increasing the number of epochs/parameters in the demographic model, and see if you can get a better fit that way.

log-likelihood is not a valid metric for comparison due to non-independence of linked sites

Depends what you mean by "valid". Under certain limiting conditions, the composite-log-likelihood should be statistically consistent (when there's many "independent-enough" blocks). Outside those conditions, it won't be consistent, but then I don't think any other estimator/loss function based on the SFS is either.

[1] https://momi2.readthedocs.io/en/latest/api.html#momi.DemographicModel.set_data