Different results in command line and web

LalicJ commented 8 months ago

Hi, I met different results when I run busted analysis with command line and web(datamonkey.org). My command line has the same parameters with web as follow : hyphy busted --alignment pal2nal_gapout_NHP_AAV.fasta --tree AAVrh85.treefile --branches Foreground --kill-zero-lengths No --error-sink No CPU=100 > hyphy_busted_85_new.out. The analysis logs of the command and web versions are as follows: hyphy_busted_85_new.txt log.txt The p value is less than 0.05 in the web version, but not in the command version. The only difference I can think of is that in the command version I entered a tree that I built myself. Do you have any suggestions? I'm really racking my brain to know why this is. Any help would be greatly appreciated. :)@spond

spond commented 8 months ago

Dear @LalicJ,

This is quite unusual, since the log likelihood is different for all models including the baseline GTR. This leads me to hypothesize that the Datamonkey tree is different. Datamonkey might be using an NJ tree. This is also what your comment seems to indicate.

Using different trees can most decidedly affect the result of a selection analysis; since you are only using a single foreground branch, the analysis could be quite unstable (small sample size) I pulled your alignment file from Datamonkey and ran BUSTED locally, using a NJ tree, and using an ML tree built by raxml-ng.

The branch you are testing is rather short (~0.001, so effectively zero). For such a short branch there is no power to detect anything at all, so the expected result is no selection

ML tree

NJ tree (quite different!)

Because the alignment is small, you can also use 2 rates (especially since in your original runs, you were getting a lot of "Collapsed Rate" notices).

ML TREE

hyphy busted --alignment msa.fas --tree ml.nwk --starting-points 5 --rates 2 --syn-rates 2 --branches TEST

* For *test* branches, the following rate distribution for branch-site combinations was inferred

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.996     |    0.000    |       Not supported by data       |
|      Diversifying selection       |     6.998     |   100.000   |                                   |

...

## Branch-site unrestricted statistical test of episodic diversification [BUSTED]
Likelihood ratio test for episodic diversifying positive selection, **p =   0.4509**.

NJ TREE

hyphy busted --alignment msa.fas --tree nj.nwk --starting-points 5 --rates 2 --syn-rates 2 --branches TEST

* For *test* branches, the following rate distribution for branch-site combinations was inferred
|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.934     |   67.411    |                                   |
|      Diversifying selection       |     1.116     |   32.589    |                                   |

....
## Branch-site unrestricted statistical test of episodic diversification [BUSTED]
Likelihood ratio test for episodic diversifying positive selection, **p =   0.5000**.

OK, so what happened with the Datamonkey result? Looking at the result page, all of the signal comes from a single codon, and a single substitution (V → I)

However, if all you do is change the tree a little bit (like this NJ tree which HyPhy builds if you provide --tree neighbor-joining option), the substitution is now on the ancestral branch and does not contribute to the evolution on the focal lineage.

This highlights the inherent fragility of these "single-branch" analyses.

HTH, Sergei

LalicJ commented 8 months ago

Thank you very much for your answer!!! The data set I used above is just an attempt to replicate the results of someone else's hyphy busted analysis. If, as you say, I actually have a very large data set (1000 + sequences but very close evolutionary distance), what is the recommended method to build the tree, and do the --rate and --starting-points parameters need to be changed? (Same to detect "single-branch")

spond commented 8 months ago

Dear @LalicJ,

Just to confirm : are you looking to detect selection on a single (predefined) branch using a large alignment of closely related species?

Best, Sergei

LalicJ commented 8 months ago

Actually, I want to detect selection on a single (predefined) branch using a large alignment of closely related virus sequences. As a rule of thumb, a virus sequence with a positive selection signal is likely to perform better in some ways.

spond commented 8 months ago

Dear @LalicJ,

Great. If you have a single predefined branch, then I would suggest the following.

Be aware of the length of the focal branch. If it's too short (e.g. ~0.01 subs/site), then all the inference will be based on as few a single substitution. This will typically result in a loss of power.
Using 100+ or 1000+ closely related sequences to study the evolution of a single branch is likely not efficient (computationally or statistically). Because of the Markovian nature of the substitution process, the further away another sequence is from the sequence of interest, the less influence the former will have on the latter.

For example, if you wish to study the evolution of the highlighted BLUE branch in this (466 sequences) IAV tree (this is an example file https://github.com/veg/hyphy/blob/master/tests/data/IAV-human-H1N1-HA.nex), then the relatively distant sequences from a different clade (red box) are unlikely to have any noticeable impact on what the ω estimate on the blue branch is.

I would suggest trimming the analysis to just the neighborhood of the node, and the path from the node to the root.

Best, Sergei

LalicJ commented 8 months ago

Thank you! I got it! I had also been struggling with the long detection time of large data sets. According to your suggestion, I will divide the data set and rebuild the tree for analysis. Thanks again for your help:).

spond commented 8 months ago

Dear @LalicJ,

Please let me know how it goes. I have found treemer (https://github.com/fmenardo/Treemmer) to be quite useful for tree trimming. Once you've trimmed the tree, you can use HyPhy scripts from https://github.com/veg/hyphy-analyses/tree/master/LabelTrees (trim-label-tree.bf) to make sure the alignment and the tree have the same set of sequences and to label the nodes you want labelled.

Best, Sergei

LalicJ commented 7 months ago

Thanks for your advice, I will use treemerto re-analyze later. My current approach is to regroup the sequences to build multiple smaller trees and no problems found at the moment.

github-actions[bot] commented 5 months ago

Stale issue message

veg / hyphy

Different results in command line and web #1684