veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
214 stars 69 forks source link

Misclassification of relaxed genes as positive (intensified) genes #730

Closed melop closed 5 years ago

melop commented 6 years ago

Hello,

I simulated about 11263 genes under relaxation by the Evolver program in PAML using a phylogeny that I am studying, and ran RELAX on these simulated genes. In particular I did no vary omega>1 and omega=1, but increased omega0 in the foreground.

At a p value cutoff of 0.05, RELAX found 5752 genes with k<1 (true positive rate = 51% ). However, I notice that there are 1657 simulated genes showed p < 0.05 and k>1. A closer inspection showed that the majority of these genes (1492) have parameter estimates of omega0 = 0 for both the foreground and the background clades. Therefore, RELAX has misinterpreted 16.57% of relaxation as positive selection.

In my real data, I found that about half the genes with k>1 have parameter estimates of omega0 = 0. So I'm not sure if they should indeed be considered relaxed.

I am wondering if this is caused by the search constraints of omega0? What happens if you don't allow it to be 0?

I am attaching 6 example simulated datasets for you to look at.

sims.tar.gz

Best Regards, Ray

spond commented 6 years ago

Dear @melop,

I am not 100% sure what your simulation scenario was (it would be helpful to have a precise specification), but from what I can understand you are increasing one ω (which is in [0,1]) on a set of branches, and maintaining the other two values. Are you also using to be using the branch site parameterization of PAML models?

  1. The data are simulated under a model different from what RELAX assumes (all ω values are scaled). A more appropriate model is what we call partitioned exploratory in the paper; it allows the foreground and background partitions to have more flexible ω distributions (see below). In other words, one might expect the RELAX test to have higher than expected rates of false positives/negatives if the model is misspecified. This effect could be exacerbated if the site-to-site rate variation model is also not the same as what RELAX assumes.

  2. An estimate of ω = 0 (when you simulated with ω > 0) might indicate that the dataset is too small (e.g. branch lengths are too short) to reliably estimate low ω values. You correctly point out that ω = 0 is a special value because raising it to any power k doesn't move the estimate (so it has no effect on the model fit).

  3. If you use the newer version of RELAX (available with the current release of HyPhy, versions 2.3.x), you may see better performance, because we improved convergence behavior. Mixture models sometimes have a hard time converging in the ML framework, so this could also be an issue.

Indeed, when I ran one of your problematic examples (5033057) through the updated version of RELAX, I obtained the result of relaxed selection (which, I understand, is what you simulated). See below:

Improving branch lengths, nucleotide substitution biases, and global dN/dS ratios under a full codon model

Fitting the general descriptive (separate k per branch) model

Selection mode dN/dS Proportion, % Notes
Negative selection 0.000 90.056
Neutral evolution 1.000 0.000 Not supported by data
Diversifying selection 1.470 9.944

Fitting the alternative model to test K != 1

Selection mode dN/dS Proportion, % Notes
Negative selection 0.269 0.386
Negative selection 0.315 90.041
Diversifying selection 1.217 9.573
Selection mode dN/dS Proportion, % Notes
Negative selection 0.018 0.386
Negative selection 0.029 90.041
Diversifying selection 1.827 9.573

Fitting the null (K := 1) model

Selection mode dN/dS Proportion, % Notes
Negative selection 0.034 89.291
Negative selection 0.384 0.422
Diversifying selection 1.846 10.287

Test for relaxation (or intensification) of selection [RELAX]

Likelihood ratio test p = 0.0000.

Evidence for relaxation of selection among test branches relative to the reference branches at P<=0.05

Fitting the partitioned descriptive model (separate distributions for test and reference branches)

Selection mode dN/dS Proportion, % Notes
Negative selection 0.352 0.000 Not supported by data
Negative selection 0.397 100.000
Diversifying selection 4.149 0.000 Not supported by data
Selection mode dN/dS Proportion, % Notes
Negative selection 0.084 89.510
Negative selection 0.457 8.326
Diversifying selection 4.149 2.164
melop commented 6 years ago

Thank you for the very helpful reply.

You are right, I was simulating 3 omega categories, as RELAX assumes. 0< omega0 <1, omega1<=1 and omega2>1. To simulate relaxed purifying selection, I increased omega0 in the foreground branches. And left omega1 and omega2 intact.

Indeed, the version I am using is 2.2.5. It seems that the newer version would be promising. I will try it out on my simulated datasets. I will report back with the results.

Best Regards, Ray

melop commented 6 years ago

Dear Sergei,

I downloaded the latest release and compiled it with gcc 6.

There seems to be some changes in the script folder, now RELAX.bf is in SelectionAnalyses folder.

When I execute a script, it now throws the following error, which seems to be a problem with interpreting the path of the scripts. The all-terms.bf file should be in HYPHY_INSTALL/res/TemplateBatchFiles/libv3 , instead the program is looking for the script under the SelectionAnalyses folder.

Error:Could not read command file in ExecuteAFile. Original path: 'libv3/all-terms.bf'. Expanded path: 'HYPHY_INSTALL/res/TemplateBatchFiles/SelectionAnalyses/libv3/all-terms.bf'

Is there a way to by-pass this?

here is the content of the script I wrote:

arrOpts = {}; arrOpts["00"] = "Universal"; arrOpts["01"] = "WORKING_DIR/in.nex"; arrOpts["02"] = "Y"; arrOpts["03"] = "T"; arrOpts["04"] = "R"; arrOpts["05"] = "Minimal"; ExecuteAFile("HYPHY_INSTALL/res/TemplateBatchFiles/SelectionAnalyses/RELAX.bf", arrOpts);

Ray

melop commented 6 years ago

Nevermind, after installing to a separate folder it works fine now.

My other question: what are the differences between RELAX.bf RELAX-Groups.bf and RELAX-SRV.bf?

I am assuming that I should use RELAX.bf?

spond commented 6 years ago

Dear @melop,

Yes, please use RELAX.bf. The other two are somewhat experimental. RELAX-SRV.bf implements a version of RELAX where you can have more than two test sets of branches and adds synonymous rate variation. We used it in this paper. RELAX-Groups.bf is relaxation test between two different genes (instead of sets of branches on the same tree). We used it in this paper to compare evolution between different overlapping reading frames in a gene.

Also, in case you haven't seen it, @sjspielman has been maintaining a great documentation resource on these analyses at http://hyphy.org/methods/selection-methods/ and http://hyphy.org/tutorials/current-release-tutorial/

Best, Sergei

melop commented 6 years ago

Dear Sergei,

I have now run two tests on the latest release v 2.3.8.

This dataset was with the following set-up of omega values:

class omega<1 omega=1 omega>1
foreground 0.2108875 1 1.6600949911812
background 0 1 1.6600949911812
proportions 0.830518 0.081762 0.087720

I have attached here the input files for Evolver in PAML.

evolver_files.tar.gz

First, I ran the same dataset for 13 times to check if RELAX always converges on the same parameter estimates. Out of 13 runs, 4 runs inferred k<1 with parameter estimates close to the simulated ones, and indeed the alternative model has a better likelihood score (LRT = 82). However, 9 out 13 times it converges on a suboptimal solution with k>1, with a suboptimal score (LRT=72). I have attached 4 independent runs from each case, hopefully it would be helpful for identifying the differences between these runs. repeated_test.tar.gz

Next, I redid the whole simulation on 11119 simulated alignments (relaxed purifying selection) and compared the results with version 2.2.

Description Count
Total 11119
p < 0.05 in v2.2 7292
p < 0.05 in v2.3 6887
p < 0.05 in both tests 6726
p < 0.05, k<1 in v2.2 5160
p < 0.05, k>1 in v2.2 1566
p < 0.05, k<1 in v2.3 4205
p < 0.05, k>1 in v2.3 2521
p < 0.05 and k<1 in both tests 3594
p < 0.05 and k>1 in both tests 955
p < 0.05, k<1 in v2.2, k>1 in v 2.3 1566
p < 0.05, k>1 in v2.2, k<1 in v 2.3 611

It looks like for these datasets, version 2.2 has fewer misclassifications and higher power than v2.3.8.

relax_version_compare

Any suggestions would be greatly appreciated.

My naive idea is perhaps to place some constraint on omega<1, where they cannot both approach close to 0?

Ray

melop commented 6 years ago

Another strange thing. I ran the same alignment on one computer for 13 times, the results are always slightly different with some variation in the digits. But when I did the same on a cluster environment from a script (the script calls the hyphy program in a loop), the results are exactly the same. I am quite confused. I suspect that on one system the random number seeds are set differently in each run while the other one always sets the same seed.

Does RELAX use any sort of random number generator? If so, is it possible to set a random seed manually?

Thanks!

Ray

spond commented 6 years ago

Dear Ray,

If your desktop has multiple cores and HyPhy runs in a multithreaded mode, then you can expect some random (low level) variation in results due to the stochasticity of execution and numerical errors. Effectively, floating operations are neither associative, nor commutative. a+b+c does not necessarily equal b+c+a. Each individual difference is minor, but over many millions of calculations, and optimization, they compound. If you run HyPhy with CPU=1 command line flag, you should see identical results.

Best, Sergei

melop commented 6 years ago

Dear Sergei,

If no CPU command is used, does HyPhy then take all available CPUs? The server node also has multiple CPUs.

In my previous post, you can see that the 13 runs all produced different results. I did not specify the number of CPUs on the command line.

Best Regards, Ray

spond commented 6 years ago

Dear Ray,

Yes, the default is to use all available CPUs. Generally, you will only have weird convergence issues in numerically unstable problems. I suspect that your simulation is one of those (the likelihood surface is really flat).

Also: can you confirm that if you fit your simulations with PAML branch-site model (which is what you simulate under), you are able to differentiate ω0 = 0 from ω0 = 0.21? This will point to whether or not model misspecification is to blame.

Best, Sergei

melop commented 6 years ago

Dear Sergei,

Indeed, I also ran all of these alignments, including the one I showed you in the branch-site test in codeml. Codeml makes very few errors , only 2.64% at p<0.05 - they are not called as positive selection.

For example, the one I showed you had codeml p=1, LK NULL -65828.284747 LK ALT -65828.284720.

Ray

spond commented 6 years ago

Dear @melop,

Technically, CodeML should call positive selection because you have a proportion of sites with ω > 1 on the foreground, right? Perhaps you could clarify which test you were running?

I am more interested in whether or not it is able to estimate ω 0 > 0 on some branches (although I suppose the PAML branch site model forces the same ω < 1 on the entire tree if I recall).

Best, Sergei

melop commented 6 years ago

Dear Sergei,

The branch-site test in codeml checks if there is a difference between the foreground and background branches. If the level of positive selection is kept identical, it is not expected to be called significant. Note that this is not the test for diversifying selection (M8a-M8 etc), but the MA test. You can see the model on page 32 on the PAML manual http://abacus.gene.ucl.ac.uk/software/pamlDOC.pdf.

I am not seeing any randomness still after setting CPU=2 on the server environment. This is getting very strange...

Ray

spond commented 6 years ago

Dear @melop,

I am very confused still; the model of page 32 (Model A, Table 3) is the standard branch site model, which does not compare foreground to background, but rather tests whether or not there is evidence that ω2 > 1 on the foreground branches:

"The null model fixes ω2 = 1. The likelihood ratio test has df = 1 (see text)."

In fact, in model A (null or alternative) one cannot accommodate ω > 1 anywhere on background branches.

I am sure you can fashion a clade - type model (Table 4 on page 32) to do the comparative test between clades, but that's not what model A does.

Can you please confirm that we are talking about the same models?

Re: CPU = 2. If the load on the server is too high, HyPhy can decide to use a single processor. It does a very crude and fast benchmark before a likelihood function is optimized. Look into messages.log for lines like

Auto-benchmarked an optimal number (5) of threads.
Auto-benchmarked an optimal number (2) of threads.
Auto-benchmarked an optimal number (7) of threads.
Auto-benchmarked an optimal number (2) of threads.
Auto-benchmarked an optimal number (8) of threads.
Auto-benchmarked an optimal number (3) of threads.
Auto-benchmarked an optimal number (2) of threads.
Auto-benchmarked an optimal number (2) of threads.
Auto-benchmarked an optimal number (9) of threads.
Auto-benchmarked an optimal number (8) of threads.

Best, Sergei

melop commented 6 years ago

Dear Sergei,

Indeed, in their parameterization, the background branches are not allowed to have omega>1. I have simulated data with only 3 categories of omegas, where 1 < OmegaBackground < OmegaForeground, omega0<1, omega1=1. The Test2 in MA model has ~35% of power to detect these cases. The model used in the simulation is different from the MA model, so I guess the parameter estimates are not really meaningful. But as a test for stronger positive selection in the foreground, it sort of works (power is somewhat low).

The Test2 in the MA model has low false positive rates for intensified purifying selection and relaxed purifying selection in the foreground clades in my simulations, usually much lower than the nominal p value.

In my simulations, if omega2 in the foreground and background are equal, the Test2 in the MA model does not call them. This suggests that Test2 in MA indeed tests for the difference in the level of positive selection between the background and the foreground, instead of a test for whether the absolute value of omega2 is >1 in the foreground.

After setting CPU=4 the result starts to show randomness.

Ray

spond commented 6 years ago

Dear @melop,

I am still investigating the source of convergence issues. It seems like the likelihood surface is quite rugged, but I am still checking.

As far as MA goes, your simulations seem to use the same ω3 in both classes of branches, so there is no increase in positive selection there. Tests based on MA (unless I fundamentally misunderstand what we are talking about) decidedly do not check for the difference in selection between FG and BG. Any such detection would be accidental and/or arise out of model misspecification.

In my simulations, if omega2 in the foreground and background are equal, the Test2 in the MA model does not call them. This suggests that Test2 in MA indeed tests for the difference in the level of positive selection between the background and the foreground, instead of a test for whether the absolute value of omega2 is >1 in the foreground.

I really don't follow this. The test is specified explicitly, i.e. there is a constraint that is being checked. The standard constraint is ω2 = 1 on foreground branches (vs ω2 > 1).

Best, Sergei

melop commented 6 years ago

Dear Sergei,

Thanks a lot for looking into this.

You're definitely right that CodeML does not model omega2>1 in the background branches, therefore the parameter estimate for the background omega2 will be ≤ 1. But it has power in detecting the scenario where the omega2foreground > omega2background> 1. I guess all models are wrong because it's impossible to capture every aspect of biological reality. But it is still useful if it captures some important aspect of it. In this case, my guess would be that the parameterization chosen by CodeML helps with convergence or other things. I am going to a meeting in April so perhaps I will get a chance to talk to Ziheng Yang about the parameterization choice.

Ray

On Jan 12, 2018 8:27 PM, "Sergei Pond" notifications@github.com wrote:

Dear @melop https://github.com/melop,

I am still investigating the source of convergence issues. It seems like the likelihood surface is quite rugged, but I am still checking.

As far as MA goes, your simulations seem to use the same ω3 in both classes of branches, so there is no increase in positive selection there. Tests based on MA (unless I fundamentally misunderstand what we are talking about) decidedly do not check for the difference in selection between FG and BG. Any such detection would be accidental and/or arise out of model misspecification.

In my simulations, if omega2 in the foreground and background are equal, the Test2 in the MA model does not call them. This suggests that Test2 in MA indeed tests for the difference in the level of positive selection between the background and the foreground, instead of a test for whether the absolute value of omega2 is >1 in the foreground.

I really don't follow this. The test is specified explicitly, i.e. there is a constraint that is being checked. The standard constraint is ω2 = 1 on foreground branches (vs ω2 > 1).

Best, Sergei

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/veg/hyphy/issues/730#issuecomment-357324136, or mute the thread https://github.com/notifications/unsubscribe-auth/ABlYwXjJs0V_d8ITGi7oaC8EGSkZdrD7ks5tJ7FlgaJpZM4RUQEY .

spond commented 6 years ago

Dear @melop,

You will have much more power if you modify the model to test what you want it to test under the assumptions you make in the simulations.

Keep in mind that the evolver code also makes specific assumptions about how the equilibrium frequencies are parameterized, how they are included in the rate matrix (GY vs MG), etc. All of these play some role in statistical performance on simulated data.

None of this, however, has any bearing on the pathologies you have seen in repeated runs of RELAX. I'll post more when I figure it out.

Best, Sergei

sjspielman commented 6 years ago

Hi everyone,

[Caveat - I have not closely been following this thread]. I will also note that evolver simulation code doesn't quite simulate dN/dS correctly, specifically in cases of heterogenous models where different omega categories are specified for a given partition. In general, code from the PAML universe does not properly normalize matrices, and as a consequence different omega categories along the same branch will effectively be simulated with different branch length parameters. This may be a confounding factor to be aware of when comparing HyPhy to PAML.

Best, Stephanie

melop commented 6 years ago

Dear Stephanie,

Being somewhat ignorant about the Evolver code, I’m not quite following your comments on the branch length. The meaning of omega is dN/dS, so for a given category of codons, the ratio between the branch lengths of the N sites and the S sites should be equal to omega? Or are you pointing to something else like the substitution matrix etc.?

Do you know of a better simulator?

Ray

On Fri, Jan 12, 2018 at 11:34 PM Stephanie notifications@github.com wrote:

Hi everyone,

[Caveat - I have not closely been following this thread]. I will also note that evolver simulation code doesn't quite simulate dN/dS correctly, specifically in cases of heterogenous models where different omega categories are specified for a given partition. In general, code from the PAML universe does not properly normalize matrices, and as a consequence different omega categories along the same branch will effectively be simulated with different branch length parameters. This may be a confounding factor to be aware of when comparing HyPhy to PAML.

Best, Stephanie

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/veg/hyphy/issues/730#issuecomment-357373254, or mute the thread https://github.com/notifications/unsubscribe-auth/ABlYwV65Jg6LhMmYsksaNqBlRN5D2Qgqks5tJ94AgaJpZM4RUQEY .

sjspielman commented 6 years ago

Hi Ray,

It has to do with the fact that several matrices are used in heterogeneous simulation (one per dn/ds). Matrices are generally normalized so that a unit branch length corresponds to a mean substitution rate of 1. The way PAML-universe code is written normalizes each matrix separately, rather than with a shared normalization factor. Thus, while the dN/dS will end up being correct, the rates from different categories will be scaled differently over the same branch length, such that sites with higher dn/ds will experience more overall substitutions than will sites with lower dn/ds (i might be backwards on this as it is Friday evening, but the general point stands). This means that while the dN/dS ratios are technically preserved, but time will pass at different rates for each dN/dS category.

Either way, I don't think this would cause the issues you're experiencing, but in theory it is possible.

For other simulators, you might try either the native HyPhy simulation (@spond can help guide you in this direction), or either of the packages pyvolve (full disclosure, I wrote this package) and phylosim. These two I know for sure normalize properly, but I don't know for certain about other simulators. The ones I mentioned are written, respectively, in python and R, if you like either of those languages.

Best, Stephanie