Closed LalicJ closed 5 months ago
Dear @LalicJ,
This is quite unusual, since the log likelihood is different for all models including the baseline GTR
. This leads me to hypothesize that the Datamonkey tree is different. Datamonkey might be using an NJ tree. This is also what your comment seems to indicate.
Using different trees can most decidedly affect the result of a selection analysis; since you are only using a single foreground branch, the analysis could be quite unstable (small sample size) I pulled your alignment file from Datamonkey and ran BUSTED locally, using a NJ tree, and using an ML tree built by raxml-ng
.
The branch you are testing is rather short (~0.001, so effectively zero). For such a short branch there is no power to detect anything at all, so the expected result is no selection
ML tree
NJ tree (quite different!)
Because the alignment is small, you can also use 2
rates (especially since in your original runs, you were getting a lot of "Collapsed Rate" notices).
ML TREE
hyphy busted --alignment msa.fas --tree ml.nwk --starting-points 5 --rates 2 --syn-rates 2 --branches TEST
* For *test* branches, the following rate distribution for branch-site combinations was inferred
| Selection mode | dN/dS |Proportion, %| Notes |
|-----------------------------------|---------------|-------------|-----------------------------------|
| Negative selection | 0.996 | 0.000 | Not supported by data |
| Diversifying selection | 6.998 | 100.000 | |
...
## Branch-site unrestricted statistical test of episodic diversification [BUSTED]
Likelihood ratio test for episodic diversifying positive selection, **p = 0.4509**.
NJ TREE
hyphy busted --alignment msa.fas --tree nj.nwk --starting-points 5 --rates 2 --syn-rates 2 --branches TEST
* For *test* branches, the following rate distribution for branch-site combinations was inferred
| Selection mode | dN/dS |Proportion, %| Notes |
|-----------------------------------|---------------|-------------|-----------------------------------|
| Negative selection | 0.934 | 67.411 | |
| Diversifying selection | 1.116 | 32.589 | |
....
## Branch-site unrestricted statistical test of episodic diversification [BUSTED]
Likelihood ratio test for episodic diversifying positive selection, **p = 0.5000**.
OK, so what happened with the Datamonkey result? Looking at the result page, all of the signal comes from a single codon, and a single substitution (V → I)
However, if all you do is change the tree a little bit (like this NJ tree which HyPhy builds if you provide --tree neighbor-joining
option), the substitution is now on the ancestral branch and does not contribute to the evolution on the focal lineage.
This highlights the inherent fragility of these "single-branch" analyses.
HTH, Sergei
Thank you very much for your answer!!! The data set I used above is just an attempt to replicate the results of someone else's hyphy busted analysis. If, as you say, I actually have a very large data set (1000 + sequences but very close evolutionary distance), what is the recommended method to build the tree, and do the --rate and --starting-points parameters need to be changed? (Same to detect "single-branch")
Dear @LalicJ,
Just to confirm : are you looking to detect selection on a single (predefined) branch using a large alignment of closely related species?
Best, Sergei
Actually, I want to detect selection on a single (predefined) branch using a large alignment of closely related virus sequences. As a rule of thumb, a virus sequence with a positive selection signal is likely to perform better in some ways.
Dear @LalicJ,
Great. If you have a single predefined branch, then I would suggest the following.
For example, if you wish to study the evolution of the highlighted BLUE branch in this (466 sequences) IAV tree (this is an example file https://github.com/veg/hyphy/blob/master/tests/data/IAV-human-H1N1-HA.nex), then the relatively distant sequences from a different clade (red box) are unlikely to have any noticeable impact on what the ω estimate on the blue branch is.
I would suggest trimming the analysis to just the neighborhood of the node, and the path from the node to the root.
Best, Sergei
Thank you! I got it! I had also been struggling with the long detection time of large data sets. According to your suggestion, I will divide the data set and rebuild the tree for analysis. Thanks again for your help:).
Dear @LalicJ,
Please let me know how it goes. I have found treemer
(https://github.com/fmenardo/Treemmer) to be quite useful for tree trimming. Once you've trimmed the tree, you can use HyPhy
scripts from https://github.com/veg/hyphy-analyses/tree/master/LabelTrees (trim-label-tree.bf
) to make sure the alignment and the tree have the same set of sequences and to label the nodes you want labelled.
Best, Sergei
Thanks for your advice, I will use treemer
to re-analyze later. My current approach is to regroup the sequences to build multiple smaller trees and no problems found at the moment.
Stale issue message
Hi, I met different results when I run busted analysis with command line and web(datamonkey.org). My command line has the same parameters with web as follow :
hyphy busted --alignment pal2nal_gapout_NHP_AAV.fasta --tree AAVrh85.treefile --branches Foreground --kill-zero-lengths No --error-sink No CPU=100 > hyphy_busted_85_new.out
. The analysis logs of the command and web versions are as follows: hyphy_busted_85_new.txt log.txt The p value is less than 0.05 in the web version, but not in the command version. The only difference I can think of is that in the command version I entered a tree that I built myself. Do you have any suggestions? I'm really racking my brain to know why this is. Any help would be greatly appreciated. :)@spond