RELAX on Datamonkey producing negative LR values

wesleykg commented 6 years ago

Hi Sergei,

I've run RELAX on the Datamonkey server about 20 times on a specific gene and I keep getting negative LR values. I understand this is because the alternative model didn't converge on the MLE (or something along those lines). Is there any way to get around this issue or should I just keep running the test until I get a positive LR value? I've attached the alignment below. and an example results link from Datamonkey.

http://test.datamonkey.org/relax/5ac3b9f133197001c7307446 rps12.txt

Thanks,

Wesley

spond commented 6 years ago

Dear @wesleykg,

I've examined your data in detail and can confirm that there is a convergence issue, but I think it is due to multimodality (the likelihood function has several local maxima). I think a part of the issue is that you are using a single branch as the TEST set. which makes estimation noisier. Also, this single branch estimates for ω are a point mass at 0 (i.e. no synonymous substitutions). This creates additional issues, because the distribution is pathological. The end result, however, that there is no significant result for relaxation. For example, fitting the NULL before the ALTERNATIVE model results in a positive LR ration (because of a different starting point).

### Fitting the null (K := 1) model
* Log(L) =  -967.96, AIC-c =  2111.44 (86 estimated parameters)
* The following rate distribution for test/reference branches was inferred

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.149     |    0.088    |                                   |
|        Negative selection         |     0.150     |   99.499    |       Collapsed rate class        |
|      Diversifying selection       |    852.488    |    0.413    |                                   |

----
### Refitting the alternative (K != 1) model
* Log(L) =  -966.38, AIC-c =  2110.36 (87 estimated parameters)
* Relaxation/intensification parameter (K) =     1.28
* The following rate distribution was inferred for **test** branches

|          Selection mode           |     dN/dS     |Proportion, %|               Notes               |
|-----------------------------------|---------------|-------------|-----------------------------------|
|        Negative selection         |     0.092     |    0.020    |                                   |
|        Negative selection         |     0.093     |   99.580    |       Collapsed rate class        |
|      Diversifying selection       |   9989.283    |    0.399    |                                   |

Generally, it is not recommended to select a single branch as the test or the reference set in RELAX analyses, because similar pathologies are likely to recur.

Best, Sergei

wesleykg commented 6 years ago

Thanks for the detailed and quick response! I was not aware I should be using more than one branch. I study parasitic plants that usually only evolved once or twice in a clade so I'm limited to using only one or maybe two branches for my test set. Is there a recommended number of branches I should be using for RELAX?

spond commented 6 years ago

Dear @wesleykg,

Like with every statistical test, you can run it with any data, you just have to be aware of the limitations. For example, for a single branch "test" set, the effect of relaxation (or intensification) has to be large in order for the test to be significant, say your "test" is very close to neutral, while the rest of the tree is under strong (typically purifying) selection. Small test (or reference) sets also create estimation instability, like what your example has shown.

So, no, there is no recommended minimum number of branches, just a warning to proceed with caution and be aware of the limitations.

Best, Sergei

veg / hyphy

RELAX on Datamonkey producing negative LR values #779