veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
205 stars 69 forks source link

Different aBSREL results for same inputs #1722

Open glitterseaworms opened 1 month ago

glitterseaworms commented 1 month ago

Hello Sergei,

I'm not sure what's going on, but when I run the same inputs on my computer vs. a cluster with CPU=10 parameter set, I get different results for this gene. Attached herein are the 2 different output files. Do you know why this is happening? And which result should I trust, given how different they are?

Thanks always for your continued help.

ATP6_deep_output.txt ATP6_aBSREL_500mDeep_output.txt

spond commented 1 month ago

Dear @glitterseaworms,

Trust neither. There is a serious issue with the alignment itself: probably lack of homology. You can see this because many branch lengths in the file are infinite (the total tree length is > 10,000 subs/site!).

image

With such alignments, where sequences look so diverged as to be seemingly unrelated, all inference becomes very unstable and unreliable.

Would you share your input alignment here so that I could take a closer look?

Best, Sergei

glitterseaworms commented 1 month ago

Dang, my data is really giving me a hard time. This is such a bummer. I think my worms in general are just very divergent. What is a tree length subs/site value cut off that you would suggest is reasonable to trust results at? I am having trouble attaching the alignments here, as it says file type not supported--can I please email them to you? Thank you for your dedication and time.

glitterseaworms commented 1 month ago

Hi Sergei, I think I may have solved the issue. It looks like if I reduce my tree and alignments to the family taxon level as opposed to representatives from the whole order that I had included, this alleviates the subs/site issue. For ATP6 gene for instance, I am now receiving a subs/site value of 9.956 as opposed to 7207.802 with the larger dataset. I will keep you posted as tests finish, but I think I will proceed by re-running everything with the reduced trees.

spond commented 1 month ago

Dear @glitterseaworms,

Thanks for the update! Could you attach the reduced ATP6 (as a .txt) file? I'll take a quick look to make sure nothing looks suspicious.

I took a look at the full (56 sequences) ATP6 file, and if you use raxml-ng to infer the ML tree, its total length is ~12 (~21 with codon models), which is not unreasonable. What have you been using to build the trees?

Here's what BUSTED-E on it looks like. I typically use it as a screen to look for oddities...

 hyphy busted --alignment ATP6_codon_aligned.fas --tree ATP6_codon_aligned.fas.raxml.bestTree --code Invertebrate-mtDNA --intermediate-fits ATP6.fit --error-sink Yes --output ATP6.BUSTED-E.json

....

* Log(L) = -22981.63, AIC-c = 46236.08 (135 estimated parameters)
* For *test* branches, the following rate distribution for branch-site combinations was inferred

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.001     |   67.171    |                                   |
|        Negative selection         |     0.018     |   32.479    |                                   |
|         Neutral evolution         |     1.000     |    0.257    |                                   |
|         Error absorption          |    100.000    |    0.094    |                                   |

* The following rate distribution for site-to-site **synonymous** rate variation was inferred

|               Rate                | Proportion, % |               Notes               |
|-----------------------------------|---------------|-----------------------------------|
|               0.079               |    53.394     |                                   |
|               0.725               |    24.993     |                                   |
|               3.593               |    21.613     |                                   |

### No evidence for episodic diversifying positive selection under the unconstrained model, skipping constrained model fitting
----
## Branch-site unrestricted statistical test of episodic diversification [BUSTED]
Likelihood ratio test for episodic diversifying positive selection, **p =   0.5000**.

Use https://observablehq.com/@spond/busted to visualize the resulting JSON

image image image

Looks like you have a couple of species that are highly diverged, possible sequencing/assembly errors because there are things like 3x substitutions along a single branch (like the red branch in the tree), and runs of sequences where there are may be some homology issues (or they are just too diverged).

Best, Sergei

glitterseaworms commented 1 month ago

Hi Sergei,

Thanks so much for responding very thoroughly. I truly appreciate you and will be acknowledging you in my manuscript :-) This means a lot and has helped me better understand results and how to properly work with my dataset. I use RAxML to build to input trees. Dang, not sure what is going on with the red branch, but that species sequence should hypothetically be correct. I do think my clade of interest is very divergent in general. I have attached the reduced ATP6 alignment. Please let me know if anything seems strange with it. All genes seem to run properly with the reduced dataset, so I am going to re-run all 13 genes with aBSREL, Busted and Meme using this reduced dataset instead. Looking forward to hearing your thoughts on my alignment, and truly, thanks again.

Best, Avery ATP6_reduced_codon_aligned.txt

glitterseaworms commented 1 month ago

Hello again,

I just wanted to follow up regarding the issue of different results when running replicates. I just finished running 4 replicates of aBSREL with my reduced data set. All subs/site values are values that can be trusted. However, for 6 of the 13 genes I am testing, I obtain variation in results across the replicates. I've attached the output files for these genes. If you have the ability, can you please see if this is normal? If so, my plan was to only report results that are the same across the 4 replicates, and dismiss other branches that show up.

Thanks again, Avery ATP6_aBSREL_Deep_rep0_output.txt ATP6_aBSREL_Deep_rep1_output.txt ATP6_aBSREL_Deep_rep2_output.txt ATP6_aBSREL_Deep_rep3_output.txt COX2_aBSREL_Deep_rep0_output.txt COX2_aBSREL_Deep_rep1_output.txt COX2_aBSREL_Deep_rep2_output.txt COX2_aBSREL_Deep_rep3_output.txt ND3_aBSREL_Deep_rep0_output.txt ND3_aBSREL_Deep_rep1_output.txt ND3_aBSREL_Deep_rep2_output.txt ND3_aBSREL_Deep_rep3_output.txt ND4_aBSREL_Deep_rep0_output.txt ND4_aBSREL_Deep_rep1_output.txt ND4_aBSREL_Deep_rep2_output.txt ND4_aBSREL_Deep_rep3_output.txt ND5_aBSREL_Deep_rep0_output.txt ND5_aBSREL_Deep_rep1_output.txt ND5_aBSREL_Deep_rep2_output.txt ND5_aBSREL_Deep_rep3_output.txt ND6_aBSREL_Deep_rep0_output.txt ND6_aBSREL_Deep_rep1_output.txt ND6_aBSREL_Deep_rep2_output.txt ND6_aBSREL_Deep_rep3_output.txt

spond commented 1 month ago

Dear @glitterseaworms,

Could you attach the alignment & tree for one of the datasets where you are seeing discordant results (e.g. ATP6)?

Best, Sergei

glitterseaworms commented 1 month ago

Attached below: 4NT_14AA_Tree_48terminals_allloci_500mDeep.txt ATP6_reduced_codon_aligned.txt

glitterseaworms commented 1 month ago

Hi Sergei,

Not sure if that's affecting it, but my tree was a mess. I am going to try using the attached tree instead with no branch lengths and unrooted. I'll run 4 replicates and see if I have the same issue. I'll keep you posted.

Avery 4NT_14AA_Tree_48terminals_allloci_unrooted_noBL_deep.txt

spond commented 1 month ago

Dear @glitterseaworms,

Ah, it looks like the default optimization precision is insufficient for this dataset (probably still too divergent). If you see negative LRTs (especially <-1) in the testing table, this is an indication. Let me look into why this happening. A few days.

Best, Sergei

glitterseaworms commented 1 month ago

Thanks Sergei! In case this is useful for the investigation in some way, I tried running ATP6 with an unrooted input tree containing branch lengths and one without branch lengths previously specified. Both outputs are attached. As you can see, aBSREL analysis with branch lengths on input tree results in subs/site 9.955, but without branch lengths on input tree results in 1583.030. However, both outputs contain -LRTs in the testing tables. Attached are the 2 different trees I used, and the 2 output files. Hopefully this helps. Thank you again!! 4NT_14AA_Tree_48terminals_allloci_unrooted_noBL_deep.txt 4NT_14AA_Tree_48terminals_allloci_unrooted_yesBL_deep.txt ATP6_aBSREL_Deep_unrooted_noBL_rep0_output.txt ATP6_aBSREL_Deep_unrooted_yesBL_rep0_output.txt