veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
201 stars 68 forks source link

Trying to interpret RELAX results #1611

Closed liamfriar closed 11 months ago

liamfriar commented 1 year ago

Hi,

I am trying to understand the distribution tables. I have taken the outputs from Spielman et al, 2017 (a tutorial). My understanding is: the main result is K=0.73 with p = 0.0014 . How do I read the three tables? I think each one is the omega distribution for a set of branches in a particular analysis, with the first being only the test branches with k allowed to vary, the second being the reference branches with k=1, and the third being the combined branch sets with k=1. So what are the three values in each table? Is this like a histogram with 3 bins, each with a max at the given omega? But if that is the case, why are there always 2x negative and 1x diversifying and how are the bin boundaries chosen? Or have I misinterpreted? I am trying to understand because I would like to know when the rejection of the null hypothesis is based on many branches in the test group or on only a small subset of the test branches. Thank you!

### Fitting the alternative model to test K != 1
* Log(L) = -14337.22, AIC-c = 28849.02 (87 estimated parameters)
* Relaxation/intensification parameter (K) = 0.73
* The following rate distribution was inferred for **test** branches
| Selection mode | dN/dS |Proportion, %| Notes |
|-----------------------------------|---------------|-------------|-----------------------------------|
| Negative selection | 0.031 | 94.752 | |
| Negative selection | 0.086 | 2.951 | |
| Diversifying selection | 1.406 | 2.297 | |
* The following rate distribution was inferred for **reference** branches
| Selection mode | dN/dS |Proportion, %| Notes |
|-----------------------------------|---------------|-------------|-----------------------------------|
| Negative selection | 0.009 | 94.752 | |
| Negative selection | 0.035 | 2.951 | |
| Diversifying selection | 1.591 | 2.297 | |
### Fitting the null (K := 1) model
* Log(L) = -14342.33, AIC-c = 28857.22 (86 estimated parameters)
* The following rate distribution for test/reference branches was inferred
| Selection mode | dN/dS |Proportion, %| Notes |
|-----------------------------------|---------------|-------------|-----------------------------------|
| Negative selection | 0.010 | 94.149 | |
| Negative selection | 0.021 | 3.391 | |
| Diversifying selection | 1.735 | 2.460 | |
----
## Test for relaxation (or intensification) of selection [RELAX]
Likelihood ratio test **p = 0.0014**.
>Evidence for intensification of selection among **test** branches _relative_ to the **reference** branches at P
<=0.05

Originally posted by @liamfriar in https://github.com/veg/hyphy/issues/1483#issuecomment-1570640108

spond commented 1 year ago

Dear @liamfriar,

It may be easier to load the .json output from RELAX into HyPhy vision (vision.hyphy.org/relax). You correctly interpret the main result; K<1 means that selection is relaxed on Test compared to Reference.

RELAX infers an N-bin (N=3 by default) distribution of ω under the branch site model. That means that at each branch and each site, the value of ω is drawn from this distribution independently of all other sites/branches. RELAX integrates (marginalizes) over all possible draws; and estimates the ω distribution. This is the branch-site random effects model.

Because the branches are partitioned into Test and Reference, they each get their own distribution of ω. The way RELAX is set up, each individual ω value for Test can be obtained from the corresponding value for Reference by using the expression ωTEST = ωREFERENCEK.

So, in your example, there are three ω categories. The smallest one allocates is ω = 0.009 (reference) with the corresponding probability 94.752 that any particular branch/site combination is drawn from it. Note that the matched ω ~ 0.031 (test), is in fact 0.009^K ~ 0.009^0.73

The following rate distribution was inferred for test branches

Selection mode dN/dS Proportion, % Notes
Negative selection 0.031 94.752
Negative selection 0.086 2.951
Diversifying selection 1.406 2.297

The following rate distribution was inferred for reference branches

Selection mode dN/dS Proportion, % Notes
Negative selection 0.009 94.752
Negative selection 0.035 2.951
Diversifying selection 1.591 2.297

The final reported distribution is for the null model, where Test and Reference branches are forced to share the same distribution.

The following rate distribution for test/reference branches was inferred

Selection mode dN/dS Proportion, % Notes
Negative selection 0.010 94.149
Negative selection 0.021 3.391
Diversifying selection 1.735 2.460

Usually, it will be something that "splits" the difference between the two previous distributions; depending on the relative "pull" of Background or Test, i.e., which one has more branches, etc.

Best, Sergei

liamfriar commented 1 year ago

Hi @spond, thank you for the above answer! I am running RELAX on ~1,000 genes (orthologous loci) across 62 genomes. I have defined test (15 leaves) and reference (47 leaves) groups (with a couple of internal nodes with mixed descendants unlabeled). I wrote a script to scrape from the stdout > run_id.txt two of the last lines to get the main result and p-value. For instance,

`Likelihood ratio test p = 0.6315.

No significant evidence for relaxation (or intensification) of selection among test branches relative to the reference branches at P<=0.05`.

I did this to try to automate the process since looking at each of the 1000 output .json files individually would be highly time consuming.

Question 1: If I just want a simple: "HyPhy finds no significant evidence for intensification nor relaxation for this gene with p value 0.6315", do you think this is an okay way to do that?

Also, some of the stdout > run_id.txt files have warnings such as:

### * Potential convergence issues due to flat likelihood surfaces; checking to see whether K > 1 or K < 1 is robustly inferred
### Potential for highly unreliable K inference due to multiple local maxima in the likelihood function, treat results with caution 
> Relaxation parameter reset to opposite mode of evolution from that obtained in the initial optimization.

which is up amongst the omega tables, or

Detected convergence issues (negative LRT). Refitting the alterative/null model pair from a new starting point

which is at the bottom of the file, so I don't know if any refitting actually happened?

Question 2: is there a good way to know which of these results I can use as is, and which outputs would need to be thrown out or re-run in some way?

github-actions[bot] commented 11 months ago

Stale issue message