veg / hyphy

HyPhy: Hypothesis testing using Phylogenies
http://www.hyphy.org
Other
211 stars 69 forks source link

Minimum species number used in branch site model analysis using absrel or busted? #1088

Closed hongzhonglu closed 4 years ago

hongzhonglu commented 4 years ago

@spond Could I ask a question about species number choice from the foreground branch in analysis using absrel and busted? I found it need a very long time to conduct absrel and busted analysis if, for example, the species number from the foreground branch analysis is over 50. I am wondering that, to reduce the analysis time, could I randomly choose about 10 species as the foreground branch to do the above analysis? Do you have any experiences to choose the number of species as the foreground branch in the branch site model analysis to accelerate the analysis? Looking forward to your comments!

Thanks very much! Hongzhong

spond commented 4 years ago

Dear @hongzhonglu,

How large is your complete dataset? The choice of branches to test should be informed by the biological question and

  1. For BUSTED, more sequences == more power (typically)
  2. For aBSREL, the default mode of testing is to look at each branch individually. As you increase the number of branches to test, the run time will increase (although it should be linear in the # of branches), and the power to detect selection in each branch will decrease (due to multiple testing correction). So for aBSREL it does help to select a smaller subset of branches to test, but only if that is informed by the biological question.

HTH, Sergei

hongzhonglu commented 4 years ago

Dear Sergei @spond , Great thanks for your help! In total I have 343 species in hand. There are about 70 species in one interesting clade, which I want to set it as the foreground. I found it takes long time in the calculation, thus I want to randomly choose several species from the interesting clade as the foreground. Is it reasonable?

spond commented 4 years ago

Dear @spond,

If anything, I would reduce the number of sequences in the background. If all of your 70 sequences are in the same clade, you should just focus on that clade with a few outgroup sequences (this would be different if your foreground sequences were intermixed with background sequences). Additionally, if you were to filter the focal clade, I would not do it randomly, but rather by selecting representative sequences. For example, if you have a number of highly similar sequences, you could choose to replace them with one.

This applies to both aBSREL and BUSTED. In general, there is a wide body of literature on taxonomic sampling (e.g. see https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2796430/), and you could refer to some of the approaches developed in that area.

I think, however, that simply dramatically reducing the # of background sequences is what you should do here.

Best, Sergei

hongzhonglu commented 4 years ago

Very nice suggestion! Great thanks for your help!