veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
205 stars 69 forks source link

Questions oabout RELAX #1558

Closed Xiaojun928 closed 1 year ago

Xiaojun928 commented 1 year ago

Hi hyphy team,

Thanks for your fancy tools! The questions below were proposed in closed issue #1066. In case the closed issue cannot be received, I submit this new one. I am sorry for troubling you.

I ran RELAX with three rate classes for a concatenated alignment (120 orthologs of 148 strains) and got infinite omega values in Partition descriptive model. In my result, AIC-c values of RELAX partitioned descriptive or General descriptive model are higher than RELAX Alternative model. This is similar to the results in #1066 , while I am not sure if I should also try to run RELAX with more than three rate classes. How to determine the number of rate classes? Could I try 4, 5, and 6 respectively, and then choose the one showing lower AIC-c in RELAX Alternative model?

I also have two questions.

Based on the K = 0.4 and P value (0), I can conclude that the selection is relaxed. When considering the omega distribution of RELAX Alternative model, I guess I can only say the positive selection is relaxed while purifying selection is not, as the proportion or value of omega1 does not differ between test and ref branches. Is it right? In my input nexus file, the alignment is nucleotide sequences, while the phylogenetic tree was constructed based on amino acid (aa) alignment (As the tree topology based on nucleotide sequences differs a lot from aa-based species tree, and the aa-based tree is more reliable). I was wondering if the tree must be constructed based on the nucleotide alignment. I plan to run RELAX for each ortholog, and the gene tree topology may differ from that of the species tree (i.e., tree based on concatenated aa alignment) in some cases, which may affect the tree labeling (I plan to label both leaves and internal nodes). As a result, two orthologs with different tree topologies may have different internal nodes labeled. Would this affect the result? Could I use one species tree for all orthologs? Thank you very much!

Best regards, Xiaojun

spond commented 1 year ago

Dear @Xiaojun928,

For your results, the RELAX partitioned descriptive (AIC-c 17697931.5) model has a much better fit than the RELAX alternative model (AIC-c 17725726.5), because the latter AIC-c is much lower.

Based on the K = 0.4 and P value (0), I can conclude that the selection is relaxed.

That is correct (selection on test branches is relaxed relative to the reference branches). Because you have ω1 = 0 and ω2 = 1, they do not change between the classes, so the entire selection signal is indeed due to the relaxation of positive selection. However, let us also look at the better fitting `RELAX partitioned descriptive model.

Branches ω3 p3
Test >10,000 0.1%
Reference 196.8 2.8%

Here, you may notice that test branches may be subject to stronger positive selection but at fewer sites, while reference branches have a larger fraction of sites under weaker selection.

Because you are fitting a concatenated long alignment, a 3-bin distribution is likely underfit, so you could indeed try more rates (--rates N) and select the model with the best AIC-c. However, as you add rate classes, there will still be only one allocated to positive selection (ω > 1). For example, when using data from the RELAX paper

Default (3 rates)

hyphy relax --alignment tests/data/Fig4E.nex --models Minimal --test T --starting-points 5

...

### Fitting the alternative model to test K != 1
* Log(L) = -4982.88, AIC-c = 10129.19 (81 estimated parameters)
* Relaxation/intensification parameter (K) =     0.00
* The following rate distribution was inferred for **test** branches

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.000     |   35.192    |                                   |
|        Negative selection         |     1.000     |   64.578    |                                   |
|      Diversifying selection       |     1.000     |    0.230    |       Collapsed rate class        |

* The following rate distribution was inferred for **reference** branches

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.000     |   35.192    |                                   |
|        Negative selection         |     0.163     |   64.578    |                                   |
|      Diversifying selection       |     1.894     |    0.230    |                                   |

+1 (4 rates), no additional benefit (worse AIC and collapsed rate classes)

hyphy relax --alignment tests/data/Fig4E.nex --models Minimal --test T --starting-points 5 --rates 4

...

### Fitting the alternative model to test K != 1
* Log(L) = -4982.72, AIC-c = 10132.93 (83 estimated parameters)
* Relaxation/intensification parameter (K) =     0.00
* The following rate distribution was inferred for **test** branches

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.000     |   34.866    |                                   |
|        Negative selection         |     1.000     |   65.093    |                                   |
|        Negative selection         |     1.000     |    0.021    |       Collapsed rate class        |
|      Diversifying selection       |     1.000     |    0.020    |       Collapsed rate class        |

* The following rate distribution was inferred for **reference** branches

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.000     |   34.866    |                                   |
|        Negative selection         |     0.165     |   65.093    |                                   |
|        Negative selection         |     0.167     |    0.021    |       Collapsed rate class        |
|      Diversifying selection       |    29.733     |    0.020    |                                   |

-1 (2 rates), no additional benefit (essentially the same as 3 rate classes, but different K because; same overall result of strong relaxation however)

hyphy relax --alignment tests/data/Fig4E.nex --models Minimal --test T --starting-points 5 --rates 2

...

### Fitting the alternative model to test K != 1
* Log(L) = -4984.88, AIC-c = 10129.11 (79 estimated parameters)
* Relaxation/intensification parameter (K) =     0.17
* The following rate distribution was inferred for **test** branches

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.670     |   98.392    |                                   |
|         Neutral evolution         |     1.000     |    1.608    |                                   |

* The following rate distribution was inferred for **reference** branches

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.095     |   98.392    |                                   |
|         Neutral evolution         |     1.000     |    1.608    |                                   |

Regarding species vs gene tree, my recommendation is to go with gene trees (even though this may complicate labeling). There is little a priori reason to believe that all genes (orthologs) will follow the species tree exactly. It is also better to run RELAX on individual genes, because the selection regime over multiple concatenated genes is going to be complex and not well described by gene-level models (like RELAX).

Best, Sergei

Xiaojun928 commented 1 year ago

Dear Sergei,

I will run RELAX gene by gene. Many thanks for your support and time!

Best, Xiaojun

github-actions[bot] commented 1 year ago

Stale issue message