Error during likelihood evaluation for data from the paper

noscode commented 2 years ago

Hi @jackkamm,

Some time ago you have provided me SFS data used in the paper about momi2, thank you very much for that! Now I am trying to evaluate the value of likelihood for the model from the paper. However I faced some problems and maybe you can help me with that?

I constructed the demographic model (without any parameters, just with fixed values) and set SFS data to it. Then I tried to evaluate log_likelihood and I got an error. The code is something like that:

model = momi.DemographicModel(N_e=1.2e4, gen_time=1, muts_per_gen=1.22e-8)
... # add leafs, merges and so on with values from the paper
data = momi.Sfs.load("sgdp_autosomes_mergeLowCov_sfs.json.gz")
length = 648467424  # 2.13 * 10^9 * 0.93 * 0.32 * 0.93 * 1.1
model.set_data(data, length=length)
print(model.log_likelihood())

The error occur in the last line and appears after some time of evaluation:

Traceback (most recent call last):
  File "momi_model.py", line 106, in <module>
    print(model.log_likelihood())
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/momi/demo_model.py", line 817, in log_likelihood
    return self._get_surface().log_lik(self._get_x())
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/momi/likelihood.py", line 100, in log_lik
    ret = self._log_lik(x, vector=vector)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/momi/likelihood.py", line 123, in _log_lik
    ret = self._get_multinom_loglik(demo, vector=vector) + self._mut_factor(demo, vector=vector)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/momi/likelihood.py", line 161, in _mut_factor
    vector, self.p_missing, self.use_pairwise_diffs)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/momi/likelihood.py", line 437, in _mut_factor
    return _mut_factor_het(sfs, demo, mut_rate, vector, p_missing)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/momi/likelihood.py", line 447, in _mut_factor_het
    sfs.sampled_pops)[sfs.ascertainment_pop])
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/momi/compute_sfs.py", line 171, in expected_heterozygosity
    sampled_n=demography.sampled_n)
  File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/momi/data/configurations.py", line 82, in __init__
    raise ValueError("config greater than sampled_n")
ValueError: config greater than sampled_n

I have tried to look at momi/data/configurations.py file and print sampled_n and max_n values right before the comparison. The result was the following:

# model is built, now evaluate data = momi.Sfs.load("")
# population labels ['Ami', 'Dai', 'Denisova', 'Dinka', 'Han', 'LBK', 'Loschbour', 'MA1', 'Mbuti', 'Neanderthal', 'Russian', 'San', 'Sardinian', 'UstIshim', 'Yoruba']
[4 6 2 4 4 2 2 1 6 2 4 4 4 2 4] 
 [4 6 2 4 4 2 2 1 6 2 4 4 4 2 4]
# data is loaded and set to model, now evaluate model.log_likelihood()
# population labels ('Mbuti', 'LBK', 'Sardinian', 'Loschbour', 'MA1', 'Han', 'UstIshim', 'Neanderthal')
[6 2 4 2 1 4 2 2] 
 [6 2 4 2 1 4 2 2]
# population labels ('Mbuti', 'LBK', 'Sardinian', 'Loschbour', 'MA1', 'Han', 'UstIshim', 'Neanderthal')
[6 2 4 2 1 4 2 2] 
 [2 2 2 2 2 2 2 2]

# Error

I am not sure but maybe this behaviour is connected with sample size of MA1 population? It is the only one with haploid data.

I am looking forward for your reply, thank you very much for your help in advance!

Best regards, Ekaterina

jackkamm commented 2 years ago

MA1 and Neanderthal should be set as "non-ascertained" populations when calling set_data, like so:

model.set_data(data, length=length, non_ascertained_pops=["MA1", "Neanderthal"])

See also the 2nd paragraph of A.4.1 in the paper.

I believe that doing this should also fix the error above.

Explanation: As part of the likelihood computations of the Poisson model, momi estimates the fraction of missing alleles in each of the ascertained populations. This estimate requires 2 alleles per population. By setting MA1 as non-ascertained, momi won't won't compute the missing-allele probability for it, and the problem should be avoided.

noscode commented 2 years ago

Hi @jackkamm,

Thank you very much, yes, that works!

Best regards, Ekaterina

popgenmethods / momi2

Error during likelihood evaluation for data from the paper #53