mossmatters / MJPythonNotebooks

Visualizing gene tree conflict using Phyparts, and ETE3
MIT License
13 stars 3 forks source link

Question about collapsing gene trees based on bootstrap values #3

Open FrancisNge opened 1 year ago

FrancisNge commented 1 year ago

Hi there,

Just curious what your thoughts are (and from others) about the threshold we use to collapse the boostrap support values for each gene tree prior to the PhyParts visualisation.

In your text, you suggested a 33% threshold cut-off. In many papers I see that use a cut-off of 10 bootstrap or lower. I tried the later and also a more conservative cut-off of 70 bootstrap or lower and this evidently affected the phyparts results. The first cut-off (bs 10) had most of the genes in conflict with the species tree, whereas the second (bs 70) had most of the genes are 'uncertain' (i.e., grey colours).

Would it be more reasonable to go for the second option? And collapse nodes that are poorly supported anyway (i.e., less than 70 bootstrap support).

Best, Francis

mossmatters commented 1 year ago

Francis,

This is a fantastic question and not one that has been explored with real data as much as I would like. The 10% threshold seems to be from the ASTRAL 3 paper (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2129-y) but intuitively this seems far too low. I had suggested 33% as this is the percentage that a conflicting quartet would be expected to have under deep coalescence.

As you move the threshold up you end up collapsing more gene tree branches, which show up as gray in the piecharts. DiscoVista, created by the ASTRAL developers, uses 75% as an example of "strong support" when summarizing gene trees. https://github.com/esayyari/DiscoVista#2-discordance-analysis-on-gene-trees

Of course, quartets are different from bipartitions. In many cases the conflicting bipartitions are a result of a single taxon "jumping" into different clades with low support. One of the ways I developed to look for this dives deeper into the minority bipartitions. It's called "minority report" and is available here: github.com/mossmatters/phyloscripts (that repo also has a more streamlined version of the phypartspiecharts script as well). Running minorityreport.py on your phyparts output for a given node will tell you which sample(s) is/are causing the conflicting bipartitions and in how many genes. I've used this to decide to delete the "rogue taxa" from analyses.

Overall it is ultimately a personal preference for what kind of gene tree bootstrap threshold you feel comfortable with for your study. For example, if you're working within a species complex or have very short protein coding genes as your loci, you may not find a lot of bootstrap support within each gene. But if you have longer loci and high divergence in your study, going with a higher threshold helps reduce the chance you're drawing conclusions from poor gene tree resolutions.

Hope that helps some!

~Matt