veg / hyphy-analyses

HyPhy standalone analyses
MIT License
38 stars 17 forks source link

Using BUSTED-PH genome-wide with unequal sampling of test and background species #33

Open mbarkdull opened 1 year ago

mbarkdull commented 1 year ago

Good morning,

I am working on a project to understand how a particular trait influences the number of genes under positive selection, at the level of the genome, and am using BUSTED-PH.

I have three species with my trait of interest, and 19 without, so in general the orthogroups I'm testing have many more background than test branches. My reviewers and I are concerned that this will result in more power to detect positive selection in the background than in the test branches, thus confounding the effect of trait presence/absence. In fact, genome-wide I do find many more orthogroups under positive selection in background species than in test species.

I'm hoping to get your thoughts on how best to address this potential problem. At present, I've tried to test the possibility that species sampling could explain my results by:

Then repeating the procedure 50 times to see the distribution of ratios that are produced by chance. My actual result falls right in the middle of the distribution of these replicates, so I wouldn't rule out that species sampling could produce the effect I found in my analysis.

Does that seem like a reasonable approach, or do you have any suggestions for how to explore/address this possible issue?

Very best, Megan Barkdull

spond commented 1 year ago

Dear @mbarkdull,

Firstly, power will generally increase for larger data sets, so a much smaller "test" group could lead to some issues. There are several considerations right off the bat.

  1. The number of sequences/branches in a group is not the best predictor of power. Both branch counts and branch lengths matter. If all you do by adding more sequences is divide them into shorter and shorter branches, overall power won't budge much, like you can see in this figure from the original BUSTED paper

    image
  2. The way the test fails here might be called failing "safe". If you have less power to detect selection associated with the trait (3 sequences), you will, at worst, not call some genes truly associated with a phenotype (false negatives). Granted, that could be a problem, but not as bad (generally) as having too many false positives.

  3. BUSTED-PH has a "difference" test as well, which is actually quite informative and provides another layer of protection against a false association.

Your "permutation" procedure makes sense, although I'd have to think about what the null distribution it generates actually describes. Something like -- "do my pre-selected three branches have more detectable selection on average than random three branches". This only makes sense if you expect the three "test" branches to under uniformly more selective pressure in all genes. The way I think about BUSTED-PH is that it identifies specific genes where selection is more intense at specific branches. There may only be a few of those, and background may well be under more selection on average as well. Not sure what I would expect.

If you are using BUSTED-PH, I would suggest running your permutation procedure through the BUSTED-PH analysis, and then reporting what fraction of the 100 randomly selected orthogroups show an association with the trait for the three "actual" branches, vs some sampling for three random branches. This would, crucially, incorporate the "difference" test component of BUSTED-PH.

Hope this helps, Sergei