Using MEME for hundreds of genes: method to correct for potentially thousands of multiple comparisons?

Hello Hyphy community

I feel like I'm posting a question every other week now and embarrassed to do so and apologize for my lacking knowledge!

I had a question with multiple comparisons for MEME (and potentially other fixed effect methods such as FEL or FUBAR). I've found this link in case other people are also thinking about the same issue I have:

http://www.hyphy.org/cgi-bin/hyphy_forums/YaBB.pl?num=1390912207/1

and I just wanted to clarify the suggestion and at the same time ask if the method I was planning to use would make sense as well. According to the suggestions provided by the link, say if I have 300 genes analyzed with MEME and on average there are 300 sites (codons) per gene. This results in a total of 300 X 300 = 90000 LRT statistics from MEME. From here in my mind it makes sense to do a FDR correction on the pooled 90000 p-values which would be ultra conservative but perhaps the most appropriate thing to do. Any sites that have q-val < 0.05 would then be significant. Would this be correct?

On the other hand what I was thinking was as following. The study I'm conducting is comparing the molecular evolution of 3 gene family (lets say gene family A,B,C) and each gene family has around 100 genes. From here my plan was to conduct MEME analysis then compare the LRT statistics for each gene family. So if gene family A as a total was evolving faster then B or C then I would expect there would be many more higher values of LRT statistics in gene family A compared to B or C. Thus perhaps doing some sort of Kruskal-Wallis test to compared the distribution of LRT statistics of A,B,C would let me answer if there are significantly more episodic selection in a specific gene family. Of course this method comes at the price of sacrificing identifying individual sites that are under selection. Is this a right method one could do though and are there any caveats?

Many thanks again for any suggestions.

Dear CJY,

Thanks for the question - these are important issues that ought to get more publicity. For a good discussion of the thinking behind FDR in this context, see this blog post by Nicolas Lartillot: http://bayesiancook.blogspot.com/2014/04/fdr-and-single-gene-experiment.html

To add to what Nicolas said, it makes sense to think about what one wants when reporting a list of sites:

Do you want to be reasonably confident that the list contains absolutely no false positives? In that case, a traditional multiple test correction controlling the family-wise error rate is the right thing to do, and will often lead you to report an empty list. This is as it should be: if you really think a single false positive invalidates the entire list, you need to err on the side of caution and in many cases an empty list will be the only way to do this. Success!

On the other hand, it seems very unlikely that this is what most people want. Typically, we all accept that biology is noisy and that false positives will occur - if we are willing to tolerate a list of sites that contains false positives (provided we know roughly how many items on the list are likely to be false positives, even though we don't know which ones they are), we can produce a much more informative list. This is what FDR is for.

Now, we could go with a q-value threshold of 0.05, producing a list that is roughly 95% true positive and 5% false positive, but that is still very conservative and will still often result in an empty list. Do we really want to be that stringent? In the end it depends on what you want to use the list for, but for most purposes we can imagine a list containing 80% true positives (q-value threshold of 0.2) or even 50% true positives (q-value threshold of 0.5) being useful. So I would pick a less stringent q-value threshold (guided by the purpose of the study) and report the list along with an estimate of the expected number of false positives on the list (the q-value threshold will give an upper bound; the sum of the actual q-values on the list gives a direct estimate of the number of false positives).

As for the second question, what you are proposing would be an ad hoc approach that may have some validity to it, but would be hard to interpret as it does not give a direct answer to a specific question. I am especially concerned that simply aggregating the answers to a set of site-level questions is unlikely to be a good way to answer a gene-level question. The main problem is that it is not clear which question you have in mind with "if there are significantly more episodic selection in a specific gene family". Do you mean more sites with at least given strength of episodic selection? Or stronger episodic selection on average (you would need to specify what you mean by "average" - the presence of purifying selection complicates things)? Or more time spent (on average per site) under positive selection? Etc. etc: there is an endless variety of questions one can come up with, any of which arguably captures something of what you're getting at. The beauty of the probabilistic approach is that we can build a model to answer any of these questions directly, but first you have to decide what the question really is.

Hope this helps, Konrad Scheffler

Hi CJY,

A few small points to add to Konrad's list

FDR procedures estimate something about the distribution of p-values under the null using the p-values you get from the tests. This makes two assumptions: that many (if not most) the the data come from the null, and that the p-values obtained from different tests are directly comparable. Neither assumption is particularly appropriate for MEME

The test is generally conservative, i.e. you have to simulate data with dN/dS = 1 on a subset of branches to hit nominal p-values. For real data you either have strong conservation or positive selection, so truly neutral data are rare. This will likely skew your q-value estimates to be more conservative than needed.
When you start comparing p-values derived from different genes, you are mixing apples and oranges. Genes have different information content, i.e. MEME will have more or less power to detect the same effect size based on the number of sequences, divergence levels, tree shapes, etc. I am not sure what this will do to q-value estimation, but probably nothing good.
Consider the following thought experiment. You have a fixed number of sequences N (say 100) and because MEME derives power from a single site (i,e. its power is a function of N), say the minimum p-value you can realistically see if 10^-5. Now as you start analyzing more and more genes, FDR or any other sequential rejection procedure will almost surely return an empty list of selected sites when the number of genes becomes large enough. In other words, the more data you get the less power you have to detect selection. This goes back to what Konrad was saying about how stringent you want to be, but you either have to keep driving the p-value down, or accept that the more tests you, the more false positives you will have, i.e. you control the proportion of false positives, but not the number.

Sergei

Hello Konrad and Sergei

Thank you both for the great extensive answers! And it made me think more with this p < 0.05 we all (well at least I have) follow by heart. In fact the reason why I was thinking of the second method was that I wanted to avoid making too much multiple comparisons as possible and try to avoid thinking the problem with multiple comparisons. So say for example I want to see if genes in glucose metabolism evolve faster then sucrose metabolism (and lets say that we have good reasons to believe genes involved in sucrose metabolism are likely to be under more constraint), from here my plan was to conduct MEME analysis (and potentially FUBAR) on all genes that are involved in glucose metabolism and sucrose metabolism. If the evolutionary pressure on a gene is close to neutral (however more likely to be under negative selection) I would assume fewer instances of high lnL per gene and this logic is then extended to groups of genes. Thus by comparing the distribution of lnL values of glucose metabolism gene and sucrose metabolism genes my thought was that one could than answer and accept or reject the null hypothesis "There is no difference in the number of sites with significant episodic selection between the glucose and sucrose metabolism system".

However reading Sergei's second comment perhaps grouping lnL values of different genes is not recommended. What if one conducted simulations to create a dataset of neutral sequences and the lnL values from that sequences used as a null distribution (used in some genome wide phylogenetic studies)?

Thank you once again for great suggestions!

veg / hyphy

Using MEME for hundreds of genes: method to correct for potentially thousands of multiple comparisons? #188