trevorcousins / cobraa

cobraa (coalescent based reconstruction of ancient admixture): a hidden Markov model to infer ghost admixture using just a diploid sequence.
MIT License
2 stars 0 forks source link

Initial guess not within specified bounds #2

Closed marqueda closed 2 months ago

marqueda commented 3 months ago

Dear Trevor,

I came across another issue - it occurs only for some individuals in my dataset, and not for others. In those individuals, the final likelihood output is "nan" and two error messages turn up in the log file (attached) saying:

Error message 1: /cluster/work/gdc/shared/p703/bin/cobraa/utils.py:616: RuntimeWarning: invalid value encountered in log ll = np.sum(np.log(scales))

Error message 2: /cluster/work/gdc/shared/p703/bin/cobraa/BaumWelch.py:373: OptimizeWarning: Initial guess is not within the specified bounds res = minimize(self.neg_objective_function_structure_rho,params_initial_guess,args=(A_new,tm_dummy,self.T_S_index,self.T_E_index),method=self.optimisation_method,bounds=bnds,options={'xtol': self.xtol,'ftol': self.ftol})

Do I need to specify different bounding parameters for those individuals or what am I doing wrong?

Here is a log file: cobraababa.PpundaVIC_12.log

And the output file: PpundaVIC.ts15.te28.its100.final_parameters.txt

Thank you for your help!

Best wishes, David

trevorcousins commented 2 months ago

Hi David,

Sorry for the delay in response, I've been on vacation.

I believe the error may be to do with the recombination rate being so large - you have set it as 10 times the mutation rate. I think there's probably something in the theoritcal transition matrix generation that does not like such a large value. Can you share an input mhs file and I'll take a look at this?

In the meantime, you have two options that might help. The first is to set the recombination rate -mu_over_rho_ratio at something less extreme, e.g. 1. The second (I'm less confident this will help) is to use the flag "-rho_fixed" in the command line, which means the algorithm will not optimise the recombination rate.

Two more general points. 1) The PSMC algorithm in my experience (and as reported in the original paper) seems to do ok even if the recombination rate it uses is way off from the actual value, so using e.g. "-mu_over_rho_ratio 1" may be ok. 2) The recombination rate being much higher than the mutation rate is in general very bad news for coalescent HMM, for relatively well understood reasons, e.g. see here for theoretical explanation https://github.com/popgenmethods/lecture_notes/blob/main/Ch9.5_TMRCA_estimation.pdf . Intuitively, if a segment recombines much more frequently than it mutates then there is no information for it's age to be inferred.

What species is this? I would be very curious about your results.

(All mentions of recombination or mutation rate above are, of course, in scaled units: recombination_rate=rho=4Nr and mutation_rate=theta=4Nmu).

Best wishes,

Trevor

marqueda commented 2 months ago

Dear Trevor,

Thank you for your reply and explanation. I am working with Lake Victoria cichlid data, which have a low SNP mutation rate of 2.4e-9 and a recombination rate of 2.3e-8 per base pair, so a quite extreme mu over rho unfortunately. I will try the two fixes you suggested, and send you the corresponding mhs files later, as the servers is currently down and I cannot access these files.

Best wishes, David

trevorcousins commented 2 months ago

There are few people in the Durbin group working on the cichlid system, we would be very interested to hear what you get! Please get in touch with the results :)

However the high mu/r is quite a serious problem for this sort of inference, and I'm not sure how much we can trust it.

marqueda commented 2 months ago

I know, I have reached out to Stephan Schiffels before, as our manuscript uses mostly MSMC - I know, same principle / problem - and currently I am just dabbling with cobraa to see whether it will give us further insight as the model might better suit the data with the Victoria cichlids that have hybridized again and again and thus are closer to the structured population model of cobraa... however, currently it seems that the two haplotype resolution with cobraa is insufficient for the recen time scales we are interested, so that MSMC might hold more information indeed. More to come soon!

trevorcousins commented 2 months ago

Indeed, just the two lineages (as in cobraa) will only give you ancient time resolution. However, if you have a mhs file per haplotypes you could give them all to cobraa and it should get more recent time, like in MSMC/MSMC2 (you would have to tweak the spread_1 and spread_2 parameters, though). Happy to do a call if helpful! Note that Stephan and Ke Wang have MSMC-IM which might be helpful for your purposes

marqueda commented 2 months ago

MSMC-IM is indeed what I ended up using. Thank you for your offer for a call regarding adding more haplotypes, I might pick you up on that! Currently, our main interest is coalescence between species, so rCCR or in MSMC-IM the M(t) / m(t) patterns to get divergence times, rather than resolving the hybrid swarm origin timing for which cobraa would be better suited. But I will send you an e-Mail if I get to explore the latter further soon. Would there be a theoretically ideal way to tweak spread_1 and spread_2 with more haplotypes?

trevorcousins commented 2 months ago

Great stuff.

I believe there is some internal MSMC code that I believe automatically tweaks spread_1 and spread_2, depending on the number of haplotypes given (I can't tell you specifically what it is doing, but we could either ask Stephan or look at the paper/code). For cobraa, the best thing would be to try a series of different parameter combinations and choose those which have highest likelihood / best looking curve (without sharp fluctuations, etc).

Best,

Trevor