lskatz / mashtree

:deciduous_tree: Create a tree using Mash distances
GNU General Public License v3.0
156 stars 24 forks source link

Help understanding boostrap results #53

Closed brettChapman closed 4 years ago

brettChapman commented 4 years ago

Hi

I've run mashtree_bootstrap.pl with 100 replicates and 1000 replicates (see attached images). Would I be correct in assuming the 100 is a percentage? When I first ran it with 100 I assumed it was the number of replicates which agreed with that branch and not a percentage of replicates. There's nowhere in the documentation I've seen which says this.

Given that some branches are low percentage confidence (e.g. 21%), could I interpret this to mean there are missing genomes for these branches to account for the low percentage and if they were later included (a much larger comparison) this would boost the branch confidence in this branch region?)

Thanks for your help.

Boostrap 100 rep pangenome_tree_bootstrap

Bootstrap 1000 rep pangenome_tree_bootstrap_1000rep

lskatz commented 4 years ago

Hi! Sorry for the confusion. Check out Figure 2 of the paper here: https://joss.theoj.org/papers/10.21105/joss.01762

The bootstrap is the percentage of times that each bootstrap tree supports that node.

A low bootstrap can be caused by many things and so fixing it is not always easy. Sometimes, it's easier to collapse some nodes because there might simply not be enough support for them. Another trick is to try to remove low-quality genomes, filter low-quality reads or trim them, or specifically for mashtree, use --mindepth 0.

brettChapman commented 4 years ago

Thanks for the explanation and pointing me to that figure. It all makes sense to me now.

I did use --mindepth 0 with the mashtree_boostrap.pl script. I guess those particular genomes were more difficult to assemble.

The genomes I'm using are apart of a bigger study on Barley. We didn't do the assembly, international collaborators of ours did.

I'm using the tree primarily as a guide tree for the Cactus genome alignment tool, however mashtree has provided some insight into differences between the genomes, grouping similarly geographically located varieties together, which could help tell a good story. It'll be worth noting the low bootstrap values between different branches as we continue with the study.

Thanks.

lskatz commented 4 years ago

Good luck @brettChapman ! Sounds interesting!

Please open a new issue or reopen this one if you run into anything else.