veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
201 stars 68 forks source link

Convergence issue with RELAX #1581

Closed shenglin-liu closed 1 year ago

shenglin-liu commented 1 year ago

Dear authors of RELAX,

I have been experiencing some convergence issues with RELAX. For some genes, RELAX can yield highly different p or k values when I ran them multiple times. I am aware that the same issue has been raised before (e.g., #1551, #1237, #1161 and #730). I have adjusted my commandline according to these pages. For example, I used one of the most recent versions of hyphy (2.5.46; installed using conda), I limited the number of CPUs to one, and I used multiple starting points. But the convergence issue still exists. I need to run RELAX on over 15000 genes. Is there a way to know which genes are having convergence issues by looking at their JSON outputs? Then I can pay special attention to them and treat them separately.

Here is an example where I ran the same gene for 10 times.

  LRP p k
REP_00 13.84 0.00020 0.04101
REP_01 15.22 0.00010 0.04210
REP_02 14.86 0.00012 0.04876
REP_03 16.39 0.00005 0.00002
REP_04 14.85 0.00012 0.04841
REP_05 15.24 0.00009 0.04116
REP_06 14.83 0.00012 0.04830
REP_07 14.85 0.00012 0.04829
REP_08 0.01 0.90859 2.70721
REP_09 14.83 0.00012 0.04827

Here is a template of the commandline I used. hyphy CPU=1 relax --starting-points 10 --alignment gene.fa --tree tree.nh --test test --reference ref --output out.json > out.log

Also, I was using "--starting-points 10". But I don't see any increase in my run time. Is that normal?

Thank you very much!

Best regards, Shenglin

spond commented 1 year ago

Dear @shenglin-liu,

Unfortunately, there is no reliable way to diagnose, from a single run, whether or not a complex mixture model like RELAX converged. From the example you show above, 8/10 runs basically gave you the same answer (relaxed selection with a p-value on the order of 0.0001), so that's pretty good.

RELAX does some internal checks during a run and if detects some convergence issues, it will attempt to resolve them during the same run (transparent to the user). However, as these are heuristic, you are not guaranteed success.

--starting-points N does not perform N distinct RELAX runs. What is does instead, is perform N quick and dirty optimizations at the start of the run, then picks the best result, and runs the full model using it as the starting point. The benefit is that, as you noticed, runtimes are not strongly affected.

If all you want is the standard RELAX test, and you wish to reduce (but not eliminate) issues due to "misconvergence", I suggest the following options

hyphy CPU=1 relax --starting-points 100 --grid-size 2000 --models Minimal --alignment gene.fa --tree tree.nh --test test --reference ref --output out.json > out.log

Setting --grid-size to a higher value makes the runs a bit slower, but improves the ability of earlier optimization runs to find a good starting point. Setting --models Minimal avoids fitting even MORE complex models (like the General Descriptive).

HTH, Sergei

PS If your alignments are smaller, and if you see Collapsed rate classes in the output, consider reducing the number of rate classes to 2, using --rates 2

Selection mode dN/dS Proportion, % Notes
Negative selection 0.000 47.395
Negative selection 0.000 5.352 Collapsed rate class
Diversifying selection 5.138 47.253