matsengrp / ecgtheow

Ancestral lineage reconstruction using BEAST or RevBayes
2 stars 2 forks source link

Rogue taxon identification #8

Open matsen opened 6 years ago

matsen commented 6 years ago

In Bayesian phylogenetics, "rogue taxa" are taxa that move around the tree. I definitely think that this is what we were seeing with the long branches in the 125 set. See, e.g. https://scholar.google.com/scholar?q=%22rogue+taxa%22&btnG=&hl=en&as_sdt=0%2C48

How shall we proceed? We could try some existing solution, or directly look for this effect in terms of things moving on and off the lineage. If there's too much uncertainty and they have long branches, perhaps they get pruned?

lauradoepker commented 6 years ago

I love the term rogue taxa. 😝

dunleavy005 commented 6 years ago

Yeah seems like an alternative to cft pruning for sure. Not quite sure from the Stamatakis paper, whether it'll improve things (he uses consensus tree improvement as validation), but I guess we just try different options and see if it leads to better ASR resolution empirically?

matsen commented 6 years ago

Er, I don't think if it as an alternative to pruning. I still like the pruning strategy, but I want to have some guidelines about when to stop adding taxa.

lauradoepker commented 6 years ago

Shouldn't we run our seq list through their rogue taxa clipping service and then prune as usual? Their input = a file containing bootstrap trees. I think we'd need to tinker and do more than our original FastTree generation, right?

matsen commented 6 years ago

Perhaps! My feeling is that we care most about things wandering around in the Bayesian posterior, so we could start by running things with a generous taxon set then do some trimming based on that.

lauradoepker commented 6 years ago

How do we generate a file of bootstrap trees? Here is the link to their program RogueNaRok

dunleavy005 commented 6 years ago

So maybe we do something to the effect of wide net prune -> beast -> rogue flagging -> edge plotting? By seeing how many were flagged by the post-beast rogue trimming would tell us the quality of the prune step.

lauradoepker commented 6 years ago

Our *.trees files are NEXUS format. Perhaps we could just reformat to Newick and submit to rogue trimming.

matsen commented 6 years ago

This "rogue taxon" idea seems like a good hypothesis for what happened with 125, but has it actually been confirmed?

Seems like we should make sure that these sequences with long branches are actually invading the ancestral lineage.

matsen commented 6 years ago

Here's an idea that will easily see if the long branches could be invading the ancestral lineage.

Use our runs to build a consensus tree, with posterior support. If the long branches are moving around too much, then the smallest clades containing them will have low support.

Ascii art:

      /--- long branch sister
-----|
      \--------------- long branch

should have low support.

dunleavy005 commented 6 years ago

Forgot to respond, I like this idea.

matsen commented 6 years ago

Copying from Slack, this shows in our case that the extra 25 taxa are rogue and are invading the seed lineage.

With just 100 sequences, we have high sister confidence:

image

But with 125 sequences:

image

lauradoepker commented 6 years ago

^why are the invaders clustered around BF520.1, specifically @matsen ? Are the rogue taxa attracted to our seed? Can we easily prune these rogues out by editing our current prune.py script or will we need to use RogueNaRok?

matsen commented 6 years ago

No, I don't think they are especially.

My thought was that for this particular data set we will do fine with taking the 100 set, that we can justify by saying that the branches get too long for the additional taxa.

I the the main problem may actually be the clock model in BEAST. I'm getting started working on that.