veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
217 stars 69 forks source link

Interpretation of RELAX across many genes #1726

Closed kferger320 closed 3 weeks ago

kferger320 commented 3 months ago

Hello, I’m running RELAX v4.5 individually on a set of genes that I hypothesize to share a signal of relaxed selection in my Test group of species. I have a question about interpretation of these aggregate results. I have a K value from each run/gene, and I want to determine if the whole gene set (distribution of K values) shares a significant signal of relaxation compared to some null expectation. In particular, that the K-value distribution of my Test set differs from a distribution of K values generated from a set of species instead chosen randomly from my phylogenetic tree, across all genes.

My first thought was to run a permutation test, where I choose a randomly sampled species set of equal size as my original Test set as my ‘permutation Test set’, and perform the RELAX runs on every gene in the same way, but using this permutation Test set as the —test branches. I would then compare the mean K value of my original K distribution (from all genes) with the null distribution generated from the means of all K’s from all permutation runs. This (as expected) turns out to be computationally prohibitive with even 100 replicates, as my gene set is ~300 genes and I have ~60 Test+Reference+unlabeled species. Is there a different way I should be thinking about this null distribution, or maybe a better way to compare my test set of K values to some null?

Thank you in advance! Kailey

spond commented 3 months ago

Dear Kailey,

I haven't thought about using RELAX this way, to be honest with you. The "proper" statistical interpretation of what RELAX is doing is testing whether or not K≠1 for a given gene. It will provide an estimate of K, but, as with any estimate, there's an error associated with it. So in that sense, K, itself is not a test statistic, like the likelihood ratio test, but just a point estimate of a model parameter. I am also not 100% sure what is the exact null that is being tested by reshuffling branch labels; you'll have to run some sort of a distributional test (e.g. Wilcoxson or Kolmogorov-Smirnov) on the estimates, which are noisy.

May I ask what is the underlying hypotheses you are trying to test?

Incidentally, there is an option in RELAX which will analyze all genes jointly and estimate a single K from all genes together. So instead of having ~300 K values where some are <1, some are >1, some are significant and some are not, you get a single K ("average" for all genes), and a single p-value to test if it's different from one.

Is that something you might find useful? It's going to be slow to run, but less slow than 100 replicates x 300 genes. If so, I can provide additional instructions for how to run it.

We originally designed it to boost power of detection for smaller alignments (e.g. all genes from a viral genome).

Best, Sergei

kferger320 commented 3 months ago

Hi Sergei, Thanks for getting back to me so quickly. The underlying hypothesis of the permutation tests would be that my original Test group, chosen because they all share a particular trait, produces an overall mean K that is significantly lower than what would be expected by chance. The reshuffling of branches many times would then provide estimates of the possible values that the mean K (over all genes) might take if the comparisons were instead between my Reference group and other Test groups, chosen randomly with respect to the shared trait. I have many species in my tree that do not share this trait (I’m actually testing several other traits separately as well), so many branches in the permutation Test groups would likely not possess my target trait.

I actually have tried to run RELAX on the entire gene set at once, basically stitching them together into a single alignment to produce a single K value. I didn’t realize there might be a specific mode in RELAX to do this though, would you mind sharing what that command would look like?

Previously the whole-alignment run kept running into convergence issues and didn’t produce a reliable K value. Do you think it would be valid to ‘prune’ the problematic genes that produced convergence issues when run on their own from the dataset, and then run the test on the rest of the un-problematic genes jointly?

Thanks, Kailey

github-actions[bot] commented 1 month ago

Stale issue message