Open trevorcousins opened 1 year ago
However the sum of expected_sfs is not the same as the sum of the observed SFS from my data.
By default, momi2 doesn't normalize against the total number of SNPs; rather, it uses the average number of hets to normalize the SFS.
The reason is that momi2 is meant to handle missing data by default (such as from low-coverage aDNA). With missing data it's hard to compute the expected total number of SNPs; but with some assumptions it's easier to compute how missing data affects the average number of hets.
If you prefer to normalize by total number of SNPs, use set_data(use_pairwise_diffs=False)
[1]. However, this only works if you have no missing alleles in any sites.
I realised in the paper and tutorial the fit of the model does not seem to be quanitified by residuals between expected vs observed SFS - is there a reason for this, e.g. do you think f_stats or all_pairs_ibs is more powerful?
For complex models with many samples & populations I found it's difficult to get a really good model fit to the full SFS on real data. But by focusing on the F-stats we can at least try to get a model that fits decently on the first couple moments of the distribution.
If you just have a single population, then I'd suggest increasing the number of epochs/parameters in the demographic model, and see if you can get a better fit that way.
log-likelihood is not a valid metric for comparison due to non-independence of linked sites
Depends what you mean by "valid". Under certain limiting conditions, the composite-log-likelihood should be statistically consistent (when there's many "independent-enough" blocks). Outside those conditions, it won't be consistent, but then I don't think any other estimator/loss function based on the SFS is either.
[1] https://momi2.readthedocs.io/en/latest/api.html#momi.DemographicModel.set_data
I am analysing the fit of a model inferred by momi2, by looking at its expected SFS compared to the observed SFS (i.e. empirically from the data). I calculate the expected SFS with:
expected_sfs = model_sim.expected_sfs()
However the sum ofexpected_sfs
is not the same as the sum of the observed SFS from my data. E.g. in my data the number of counts is 1,291,803 but the sum fromexpected_sfs
= 944,199.7Of course I can normalise each and compare those, which e.g. gives:
But i wonder, is the fact that
expected_sfs != observed_sfs
indicative of a problem in my analysis/code?It was surprising to me also that there was not a better agreement between the two curves; though of course that could indicate a poor choice of model. I realised in the paper and tutorial the fit of the model does not seem to be quanitified by residuals between expected vs observed SFS - is there a reason for this, e.g. do you think f_stats or all_pairs_ibs is more powerful? It is my understanding that using log-likelihood is not a valid metric for comparison due to non-independence of linked sites.