veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
200 stars 68 forks source link

When does RELAX collapse rate classes? #1708

Closed NatJWalker-Hale closed 1 month ago

NatJWalker-Hale commented 1 month ago

Hi guys,

As always, thanks so much for developing and supporting such useful methods. I just wanted to ask when and how RELAX chooses to collapse rate classes? From RELAX.bf I can see if (relax.inferred_ge_distribution[0][1] < 1e-5 || relax.inferred_ge_distribution[1][1] < 1e-5) - is this checking the inferred dN/dS of a rate class, or the inferred proportion of sites?

Thanks very much for the help!

Nat

spond commented 1 month ago

Dear @NatJWalker-Hale,

RELAX checks the proportion of sites in ω categories 1 or 2 but just for the general exploratory model. Are you seeing this happen in your data?

Best, Sergei

NatJWalker-Hale commented 1 month ago

Dear @spond,

Thanks very much for the reply! I think that I'm understanding the output a little bit better now. I also ought to have mentioned that this is v2.5.60. For my example, the stdout contains:

### Fitting the general descriptive (separate k per branch) model

### * Log(L) = -6522.20, AIC-c = 13138.83 (47 estimated parameters)
* The following baseline rate distribution for branch-site combinations was inferred

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.000     |    4.060    |                                   |
|        Negative selection         |     0.000     |   93.895    |                                   |
|      Diversifying selection       |    48.917     |    2.044    |                                   |

and then

### Fitting the alternative model to test K != 1
* Log(L) = -6527.00, AIC-c = 13132.30 (39 estimated parameters)
* Relaxation/intensification parameter (K) =     0.64
* The following rate distribution was inferred for **test** branches

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.572     |   79.236    |                                   |
|         Neutral evolution         |     1.000     |   20.764    |                                   |
|      Diversifying selection       |     1.007     |    0.000    |       Not supported by data       |

* The following rate distribution was inferred for **reference** branches

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.418     |   79.236    |                                   |
|         Neutral evolution         |     1.000     |   20.764    |                                   |
|      Diversifying selection       |     1.011     |    0.000    |       Not supported by data       |

followed by

### Fitting the null (K := 1) model
* Log(L) = -6527.12, AIC-c = 13130.52 (38 estimated parameters)
* The following rate distribution for test/reference branches was inferred

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.457     |   77.869    |                                   |
|        Negative selection         |     0.878     |   22.131    |                                   |
|      Diversifying selection       |     1.014     |    0.000    |       Not supported by data       |

(so no significant evidence)

But just to be clear, even though the proportion is 0, the alternative and null are still fitting that site class, correct?

The Not supported by data message presumably appears when any proportion is estimated as 0. Is this also the case for the output JSON? For this example, none of the model fits have p2 or omega2 in the output.

On the other hand, I have just noticed that some of the sites in the alignment were empty (due to a pruning error), and now that I've removed those sites and re-run, the output JSON does have p2 and omega2 for each model (with p2 still 0). So I'm not sure what is going on there, but sanitising the input seems to have fixed it anyway.

Thanks again!

Nat

spond commented 1 month ago

Dear @NatJWalker-Hale,

The intended behavior for Not supported by data screen output is as you surmised: the specified number of rates are always fitted, but if one (or more) of them have no support (very low weight) or essentially the same dN/dS values, the code will simply flag them as such for used information.

One exception is that if you fit All models (including the general exploratory), then the code can kick down the number of rates if it finds lack of support for the user-specified number of rates https://github.com/veg/hyphy/blob/c668448e3ec8150389df48d9ab703c424fe628c9/res/TemplateBatchFiles/SelectionAnalyses/RELAX.bf#L621

This will be indicated by the following echo to stdout

### Because some of the rate classes were collapsed to 0, the model is likely overparameterized. RELAX will reduce the number of site rate classes by one and repeat the fit now.
----

My guess is that's what happened when you had no corresponding rates in the JSON file (otherwise they should have been there, just with near 0 weights).

Best, Sergei

NatJWalker-Hale commented 1 month ago

Dear @spond,

Okay, that's good to know! Thanks very much for clarifying.

Cheers!

Nat