veg / hyphy-analyses

HyPhy standalone analyses
MIT License
36 stars 17 forks source link

Contrast-FEL analyses #53

Open JJohnSmith opened 1 month ago

JJohnSmith commented 1 month ago

Hi ,

I am new to this field and I am currently working on my master's thesis, focusing on the evolution of a specific morphological trait., I've opted to analyze 60 genes across a smaller number of species due to time constraints. I created two separate datasets: one for mammals (15 species) and another for archosaurs (12 species). In the mammalian dataset, there are two groups exhibiting the trait, whereas in the archosaur dataset, only one group does.

For selection analyses, I used several methods in HyPhy, specifically BUSTED, aBSREL, RELAX, and Contrast-FEL. I labeled the branch of the most recent common ancestor of the groups with the trait as the foreground branches. Consequently, I have two foreground branches in the mammalian dataset and one foreground branch in the archosaur dataset.

I've noticed that Contrast-FEL detected significantly more positively selected sites in the archosaur dataset compared to the mammalian dataset. When I re-ran the analyses on the mammalian dataset, this time with just one foreground branch (I removed the other group from the alignments), the number of positively selected sites in Contrast-FEL increased.

I came across this information on the HyPhy website under Contrast-FEL specifications:

"Rules of thumb for when this method is likely to work well, and when it is not. -Generally, you need 10 or more branches in each set to be able to have any statistical power. -Too little divergence is also likely to severely throttle statistical power."

It appears that my dataset configuration is problematic, lacking the necessary number of branches in each set, leading to low statistical power and inconsistent results. Given my time constraints, I cannot make significant changes to the datasets.

I was wondering if lowering the q-value threshold from 0.2 to 0.1 (or another value within that range) would be a viable approach to mitigate these issues.

Any advice would be greatly appreciated.

Thank you so much!

spond commented 1 month ago

Dear @JJohnSmith,

If I understand your procedure correctly then,

  1. For each gene, you have two separate alignments, one for mammals and one for archosaurs. They each have separate trees, which you have labeled, and analyzed one at a time.
  2. You ran contrast-fel on these genes separately, using the MRCA branch of the clade/clades with the trait of interest and the rest of the tree as background.
  3. contrast-fel reported a list of sites, separately for mammals and archosaurs, and you compare those.

If this is the case, then constrast-fel is comparing selection between trait MRCA and the rest of the mammalian tree, and, separately, for the archosaurs tree. This is probably not the comparison that you want to perform. If the primary goal is to compare mammals and archosaurs, some sort of a joint analysis needs to be performed. The easiest way to do it is to create an alignment of mammals and archosaurs (assuming it is possible), and to run contrast-fel on it.

Would you mind sharing some of the alignments and trees and the corresponding results? It'll be much more informative if I could base my comments on the specific data.

Best, Sergei

JJohnSmith commented 1 month ago

Dear @spond,

I apologize if I was not clear enough and it seemed a little bit confusing.

I am not comparing mammals with archosaurs. I am comparing mammals with mammals in one analysis, and within archosaurs I am comparing crocodylians with birds.

I am analyzing around 50 genes and unfortunately I do not have the time and neither the computational power to perform the analysis in a combined dataset of mammals and archosaurs.

MammalianTree

ArchosaurTree

Above are the trees, and in red are the branches I labeled as foreground to be tested.

The alignments of the mammalian sequences are quite complete, I used Guidance which automatically removed columns with a confidence score bellow 0.93 and additionally I manually refined it to remove columns that contained less than 80% info.

For the alignments of archosaurs I lowered a little bit the threshold of Guidance to 0.85 because of the higher divergence between crocodylians and birds.

Results_HyPhy.xlsx

Above is a summary of the results. I also conducted tests with aBSREL and BUSTED.

It seems reasonable that I got a lot more positively selected sites in Archosaurs considering the phylogenetic divergence between crocodylians and birds, right? Also, I am thinking that since I lowered the confidence score threshold in Guidance when performing the alignments, maybe some specific regions of lower confidence could lead to some false positives. I intend to look closely at each positively selected site in the alignments to check if maybe problems in the alignment could be leading to some false positives.

My main concern is that there isn't enough statistical power in my analyses since I only am testing 2 branches in mammals and only 1 branch in archosaurs, and that may result in unreliable results with high rates of false positives and negatives.

Considering that this is for a masters degree dissertation with constraints in time and computational power, do you think that this particular subsets of foreground and background branches can make the results completely unreliable due to the lack of statistical power?

I also think that it is important to mention that it is not known yet if the genes I am analyzing are directly regulating the trait of interest. I am just exploring that hypothesis because the genes were identified in a transcriptomic analysis in a mammalian species and the trait is also found in crocodylians.

Thank you so much for taking the time to respond.