vlanore / diffsel

A C++ program to detect convergent evolution using Bayesian phylogenetic codon models.
Other
6 stars 2 forks source link

how to determine the iterations #5

Closed kanglizhu closed 5 years ago

kanglizhu commented 5 years ago

After ruuning 200 interations, I use Raftery and Lewis’s Diagnostic implemented in the R package coda to estimate the number of necessary iterations, then run as many iterations as 120% of the estimated number.

When I use "md1 <- mcmc(mydata) test = raftery.diag(md1, r=0.01)" to estimate the number of necessary iterations, a total of 1124 iterations are needed. However when i use " md1 <- mcmc(mydata) test = raftery.diag(md1, q=0.025, r=0.005, s=0.95, converge.eps=0.001)", a total of 4495 iterarions are needed.

Here I run diffsel with iterations of 2,000, 4,495 and 10,000, the sites identified by 4,495 and 10,000 points are the same, however they are differ singdiffsel_results.zip ent from 2,000 points. I have uploaded the *sigdiffsel results as supplements.

Do you mind explaing how to calculate the necessary iterations?

Best wishes, kangli

vlanore commented 5 years ago

Regarding how to estimate the number of iterations: what we usually do for diffsel is

  1. try to guess the number of iterations (1500 is often a good baseline);
  2. use tracecomp to see if it converged;
  3. if it converged, then use that number of iterations for this dataset and other datasets of the same complexity;
  4. if it did not converge, increase the number of iterations and go back to step 2.

After a while, we usually get a good sense of many iterations are needed for a given problem. And tracecomp is always there to validate that a given run did indeed converge.

The Raftery&Lewis diagnostics implemented in the review pipeline was meant to automate this process but I did not write it and cannot help you about it specifically. If you need help with it you should put a ticket on the repo at https://gitlab.in2p3.fr/pveber/reviewphiltrans/issues where the developers of the script will see it.

Regarding your result for 2000, 4495 and 10000 iterations, it is totally normal that the list may change slightly from run to run. The whole algorithm is probabilistic and there is always a small variability in the results. Effects which are very close to the 0.9 threshold may appear or disappear depending on the run.

In your case, the line that changes between runs is

313 17  0.900625    1.09443

which is present only in the 2000 iterations run. Note that the probability is very close to 0.9. In another run, this probability might, for example, have been evaluated to be 0.8999 and would thus not have been selected.

The sites that appear or disappear depending on the run are sites that are difficult to classify. This is unavoidable in threshold-based detection. As long as you made sure that your chain converges (for ex. using tracecomp) your results minus these difficult sites should be consistent across runs.

Hoping this helps you with your problems.

kanglizhu commented 5 years ago

Thank you so much for your help. I will use tracecomp to see if it convrged.

best wishes, kangli