veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
200 stars 68 forks source link

aBSREL foreground lineages #1669

Closed JuliaLopezDelgado closed 4 months ago

JuliaLopezDelgado commented 7 months ago

Hello,

I am planning on using aBSREL to identify signatures of positive selection across seven foreground lineages, which are composed of individual species and also internal nodes with all descendants. As per the image attached, I want to test for selection in 1 (including A-J), 2 (including A-D), 3 (including B-D), A, B, C, and D.

Foreground nwk

Would I need to label seven foreground lineages with https://github.com/veg/hyphy-analyses/tree/master/LabelTrees and then run aBSREL seven times for each ortholog (I have >6,000 genes)? Would you recommend aBSREL over BUSTED for my analysis?

Thanks in advance!

spond commented 7 months ago

Dear @JuliaLopezDelgado,

  1. If you are using a species tree (as opposed to a gene tree), you can label all 7 branches with one tag (for aBSREL and BUSTED), and then pass --branches tag (replacing tag with whatever you use) on the command line.
  2. BUSTED and aBSREL will answer different questions. BUSTED will generally have more power (because it uses all 7 branches together), but you won't be able to tell which branch(es) are driving the signal. aBSREL will resolve individual branches (if possible), but may lose power because each branch is considered in isolation and will also be subject to multiple testing correction.

HTH, Sergei

JuliaLopezDelgado commented 7 months ago

Hello @spond,

Thanks for your quick response!

I am using gene trees, so for each ortholog I have a nucleotide alignment and a gene tree. I guess this means I have to label the foreground lineages and run them separately for each ortholog? Also, do I still need to specify the --branches tag if I have previously labelled the foreground lineages in each gene tree?

I think aBSREL might make more sense for my dataset given that information.

Best, Julia

spond commented 7 months ago

Dear @JuliaLopezDelgado,

  1. There is a way to label trees programmatically. Take a look at https://github.com/veg/hyphy-analyses/tree/master/LabelTrees

  2. aBSREL is the way to go based on what you describe, and yes, you need to specify the --branches argument, because otherwise the analysis will test all branches. This will take longer and will reduce power (because more tests will be done). Youl could also consider using the --srv Yes option to enable support for site-to-site rate variation.

Best, Sergei

JuliaLopezDelgado commented 7 months ago

Dear @spond,

Many thanks for your help!

I tried to label the foreground lineages sequentially with LabelTrees, but it did not seem to work when the branch had more than one label and it did not label lineage "2" in my example above. Therefore, I assume that for my analysis, where I need to test 7 foreground lineages, it is best to run these separately so I am running aBSREL for each of the >6000 genes 7 times. Hope this sounds okay.

Best, Julia

spond commented 7 months ago

Dear @JuliaLopezDelgado,

  1. It is highly inefficient to run aBSREL 7x to test 7 branches. You will be redoing 90+% of the work in each run, and also not applying the right multiple testing correction. You should instead label all the branches as the same group (e.g. test) and run aBSREL with that group selected.
  2. Can you please elaborate on your labeling issues? Why are looking to have multple labels per branch? Could you please provide an example tree and the desired labeling?

Remember, that aBSREL tests for selection on individual branches. In your example tree above I would simply select the following subset of branches and test them in aBSREL. Then you can draw conclusions about lineages (paths), based on what the constituent branches return.

image

Alternatively, you could use BUSTED to test for selection jointly on a group of branches (one group = one test), for example for Group 3 above

image

Best, Sergei

JuliaLopezDelgado commented 7 months ago

Hi @spond,

Thanks for letting me know!

I need to test for positive selection at different levels: species (A, B, C, D), genus (B-D) and family (A-J). I thought I needed to set different foreground lineage labels to specify this (eg. the fact that I want to test D as a species but also as part of the genus B-D and the family A-J. Can I test this by specifying the family A-J as the foreground lineage (as you highlighted in the first figure) but still identify species-level and genus-level selection?

Best, Julia

spond commented 7 months ago

Dear @JuliaLopezDelgado,

Well, it depends on the question at hand. Let's take a simple example here, which probably has nothing to do with your biological domain, but should illustrate the concepts.

The attached alignment comprises partial HIV-1 sequences from two individuals in a transmission pair. There are three sets of branches defined by the context

  1. Sequences from the Donor individual (orange)
  2. Sequences from the Recipient individual (green)
  3. Transmission branch (blue)
image

If I run aBSREL on the entire tree, I get some information about every branch (this is the new 2.5 version in develop branch, but that doesn't matter too much)

hyphy absrel --alignment HIV-pair.txt 

...

### Improving parameter estimates of the adaptive rate class model
* Log(L) = -2009.77, AIC-c =  4178.26 (78 estimated parameters)

### Testing selected branches for selection

|              Branch               |  Rates   |     Max. dN/dS     | Sites @ EBF>=100 |      Test LRT      |Uncorrected p-value |
|-----------------------------------|----------|--------------------|------------------|--------------------|--------------------|
|               Node1               |     2    |  676.61 ( 6.45%)   |          8       |       51.13        |       0.00000      |
|              D20_230              |     2    |   >1000 ( 0.68%)   |          2       |        4.16        |       0.04587      |
|              D20_233              |     2    |   >1000 ( 2.74%)   |          5       |       20.58        |       0.00001      |
|              Node24               |     2    |   81.70 ( 5.97%)   |          7       |        4.54        |       0.03759      |
|              R20_244              |     1    |   1.07 (100.00%)   |       N/A        |       -0.25        |       0.50000      |
|              Node16               |     3    |   >1000 ( 1.33%)   |          3       |       21.99        |       0.00001      |
|              D20_231              |     1    |   0.38 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              R20_242              |     1    |   0.65 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              Node17               |     1    |   1.36 (100.00%)   |       N/A        |       -0.18        |       0.50000      |
|              R20_243              |     1    |   1.50 (100.00%)   |       N/A        |       -0.12        |       0.50000      |
|              Node15               |     1    |  >1000 (100.00%)   |       N/A        |        0.23        |       0.39288      |
|              Node21               |     1    |   1.18 (100.00%)   |       N/A        |       -0.25        |       0.50000      |
|               Node2               |     1    |  >1000 (100.00%)   |       N/A        |        1.57        |       0.17864      |
|              R20_241              |     1    |   0.16 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              R20_245              |     1    |  >1000 (100.00%)   |       N/A        |        1.42        |       0.19390      |
|              R20_240              |     1    |   0.65 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|               Node5               |     1    |   0.32 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|               Node3               |     1    |  >1000 (100.00%)   |       N/A        |        0.46        |       0.33845      |
|              R20_239              |     1    |  >1000 (100.00%)   |       N/A        |        0.01        |       0.48517      |
|               Node4               |     1    |  >1000 (100.00%)   |       N/A        |        0.03        |       0.46543      |
|              D20_235              |     1    |   1.00 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              D20_237              |     1    |   1.00 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              D20_232              |     1    |   1.00 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              D20_234              |     1    |   1.00 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              R20_238              |     1    |   1.00 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              D20_236              |     1    |   1.00 (100.00%)   |       N/A        |        0.00        |       1.00000      |
----
### Adaptive branch site random effects likelihood test 
Likelihood ratio test for episodic diversifying positive selection at Holm-Bonferroni corrected _p =   0.0500_ found **3** branches under selection among **26** tested.

* Node1, p-value =  0.00000
* Node16, p-value =  0.00014
* D20_233, p-value =  0.00027

If I restrict absrel to a subset of branches, all that will happen is that fewer tests will be run, but for each individual branch, the result will be the same (the p-values are a bit different because the analysis is running multiple testing correction based on the # of branches tested).

hyphy absrel --alignment HIV-pair.txt --branches DONOR
....

### Testing selected branches for selection

|              Branch               |  Rates   |     Max. dN/dS     | Sites @ EBF>=100 |      Test LRT      |Uncorrected p-value |
|-----------------------------------|----------|--------------------|------------------|--------------------|--------------------|
|               Node1               |     2    |  676.61 ( 6.45%)   |       N/A        |    Not selected    |    for testing     |
|              D20_230              |     2    |   >1000 ( 0.68%)   |          2       |        4.16        |       0.04585      |
|              D20_233              |     2    |   >1000 ( 2.74%)   |          5       |       20.90        |       0.00001      |
|              Node24               |     2    |   81.70 ( 5.95%)   |          7       |        4.54        |       0.03759      |
|              R20_244              |     1    |   1.07 (100.00%)   |       N/A        |    Not selected    |    for testing     |
|              Node16               |     3    |   >1000 ( 1.33%)   |          3       |       21.92        |       0.00001      |
|              D20_231              |     1    |   0.38 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              R20_242              |     1    |   0.65 (100.00%)   |       N/A        |    Not selected    |    for testing     |
|              Node17               |     1    |   1.36 (100.00%)   |       N/A        |       -0.18        |       0.50000      |
|              R20_243              |     1    |   1.50 (100.00%)   |       N/A        |    Not selected    |    for testing     |
|              Node15               |     1    |  >1000 (100.00%)   |       N/A        |       -0.16        |       0.50000      |
|              Node21               |     1    |   1.18 (100.00%)   |       N/A        |       -0.25        |       0.50000      |
|               Node2               |     1    |  >1000 (100.00%)   |       N/A        |    Not selected    |    for testing     |
|              R20_241              |     1    |   0.16 (100.00%)   |       N/A        |    Not selected    |    for testing     |
|              R20_245              |     1    |  >1000 (100.00%)   |       N/A        |    Not selected    |    for testing     |
|              R20_240              |     1    |   0.65 (100.00%)   |       N/A        |    Not selected    |    for testing     |
|               Node5               |     1    |   0.32 (100.00%)   |       N/A        |    Not selected    |    for testing     |
|               Node3               |     1    |  >1000 (100.00%)   |       N/A        |    Not selected    |    for testing     |
|              R20_239              |     1    |  >1000 (100.00%)   |       N/A        |    Not selected    |    for testing     |
|               Node4               |     1    |  >1000 (100.00%)   |       N/A        |    Not selected    |    for testing     |
|              D20_235              |     1    |   1.00 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              D20_237              |     1    |   1.00 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              D20_232              |     1    |   1.00 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              D20_234              |     1    |   1.00 (100.00%)   |       N/A        |        0.00        |       1.00000      |
|              R20_238              |     1    |   1.00 (100.00%)   |       N/A        |    Not selected    |    for testing     |
|              D20_236              |     1    |   1.00 (100.00%)   |       N/A        |        0.00        |       1.00000      |
----
### Adaptive branch site random effects likelihood test 
Likelihood ratio test for episodic diversifying positive selection at Holm-Bonferroni corrected _p =   0.0500_ found **2** branches under selection among **13** tested.

* Node16, p-value =  0.00008
* D20_233, p-value =  0.00012

So based on the result of this analysis there are some branches under selection in

1). Transmission (Node 1) 2). Donor (Node16 and D20_233)

image

A potential issue here is that by testing only individual branches, you may lose power to detect selection on the whole in the source or the recipient

You could run BUSTED on D, R branches to test for selection on the entire set (but losing the resolution to tell which branches are involved). There's no point in running T because it already is a single branch.

hyphy busted --alignment HIV-pair.txt --srv No --branches DONOR

...

## Branch-site unrestricted statistical test of episodic diversification [BUSTED]
Likelihood ratio test for episodic diversifying positive selection, **p =   0.0000**.

hyphy busted --alignment HIV-pair.txt --srv No --branches RECIPIENT

...


Branch-site unrestricted statistical test of episodic diversification [BUSTED]

Likelihood ratio test for episodic diversifying positive selection, p = 0.2718.



So here, `BUSTED` confirms selection in `D` and further shows that even when pooling over all branches, there's no evidence of selection in `R`.

Best,
Sergei

[HIV-pair.txt](https://github.com/veg/hyphy/files/13602685/HIV-pair.txt)
JuliaLopezDelgado commented 7 months ago

Hi @spond,

Thanks so much for the detailed explanation!

So would aBSREL treat each daughter node as its own independent foreground lineage (estimating dN/dS in a site-specific way in each daughter node and providing model fit results) when labelling the entire family (A-J in the figure above) as the foreground lineage?

Best, Julia

spond commented 7 months ago

Dear @JuliaLopezDelgado,

Correct; if you label all the branches in the maximal set of interested, each will be individually tested by aBSREL.

Best, Sergei

JuliaLopezDelgado commented 7 months ago

Dear @spond,

Great! I am running aBSREL with the family set as the foreground lineage.

Thanks so much for all your help, I really appreciate it.

Best, Julia

JuliaLopezDelgado commented 7 months ago

Hi @spond,

One last question: should I be using the full species tree (with all the species in the dataset, even if some are not present in the gene alignment being tested - as the genes may have different species subsets) or a pruned species tree that only contains the representatives present in each alignment?

Many thanks, Julia

spond commented 7 months ago

Dear @JuliaLopezDelgado,

aBSREL (and most HyPhy analyses in general) require that there be a 1-1 match between the tree and the alignment. In your situation you may want to use something like trim-tree.bf from https://github.com/veg/hyphy-analyses/tree/master/LabelTrees to take an alignment and a "master" tree with all the species, and write out a NEXUS file with the alignment + trimmed tree.

Best, Sergei

JuliaLopezDelgado commented 6 months ago

Dear @spond,

Thanks for clarifying! I have already been using pruned species trees that only contain the representatives present in each alignment but just wanted to check.

Best, Julia

JuliaLopezDelgado commented 6 months ago

Dear @spond,

LabelTrees hasn't labelled one of the nodes I need to test. Here is a snippet of a tree that has 20 species, although I am only interested in testing for selection on the four species shown, as well as Node1, Node 31 and Node33: image

I labelled the foreground lineage with LabelTrees as follows (gal.txt lists the four species): $hyphy $LabelTrees --tree tree.nwk --list gal.txt --output out.nwk

However, Node1 doesn't appear labelled as Foreground (as shown above) and aBSREL hasn't tested this node either. Is there any way to specify this lineage as foreground?

Many thanks, Julia

spond commented 6 months ago

Dear @JuliaLopezDelgado,

Is Node1 the root or are there nodes more ancestral to it?

HyPhy in general and LabelTree in particular will assume unrooted trees, because all of the selection models in common use are time reversible. You can override it at the labeling stage (add `ENV="ACCEPT_ROOTED_TREES=1" to the command line), but this won't make any difference at the selection screening stage, where the trees will be unrooted automatically.

Best, Sergei

JuliaLopezDelgado commented 6 months ago

Dear @spond,

There are other nodes more ancestral to Node1.

I just tried the following command but I get the same labels as in the image attached above, with Node1 not being labelled as foreground: $hyphy $LabelTrees --tree tree.nwk --list gal.txt --output out.nwk ENV="ACCEPT_ROOTED_TREES=1"

Is there anything else I could try?

Many thanks, Julia

github-actions[bot] commented 4 months ago

Stale issue message