glennhickey / progressiveCactus

Distribution package for the Prgressive Cactus multiple genome aligner. Dependencies are linked as submodules
Other
79 stars 26 forks source link

Sensitivity to tree topology and branch lengths? #109

Closed jasonsydes closed 5 years ago

jasonsydes commented 6 years ago

Hello there!

The summary: How sensitive is Progressive Cactus to input tree topology and/or branch lengths? (Mainly in terms of correctness, but we're also curious about runtime...)

We're trying to build a tree so that we can begin using Progressive Cactus. At the moment, we're stuck on the fact that we don't understand how sensitive Cactus is to the tree. We have certainly seen that runtime suffers dramatically if you give Cactus a star guide tree (Cactus does much better if you give it a binary guide tree).

Fair enough, we can provide Cactus a binary guide tree. How carefully do we need to build that tree? What if we get the topology a little wrong? A lot wrong? We've seen at least some other people are using a tree with no branch lengths. Is that best practice? Or is it better to use a tree with branch lengths? How sensitive is Cactus to branch lengths? What if we get the branch lengths a little wrong? A lot wrong?

We're just trying to get a sense of how much effort we should dedicate to this aspect of running Cactus.

Thank you so much for your time! Jason Sydes

diekhans commented 6 years ago

We recently did some experiments using various reasonable trees and it showed very little impact in using different trees. The person with the timing results is away, however choice of guide tree didn't impact the alignment results.

Jason Sydes notifications@github.com writes:

Hello there!

The summary: How sensitive is Progressive Cactus to input tree topology and/or branch lengths? (Mainly in terms of correctness, but we're also curious about runtime...)

We're trying to build a tree so that we can begin using Progressive Cactus. At the moment, we're stuck on the fact that we don't understand how sensitive Cactus is to the tree. We have certainly seen that runtime suffers dramatically if you give Cactus a star guide tree (Cactus does much better if you give it a binary guide tree).

Fair enough, we can provide Cactus a binary guide tree. How carefully do we need to build that tree? What if we get the topology a little wrong? A lot wrong? We've seen at least some other people are using a tree with no branch lengths. Is that best practice? Or is it better to use a tree with branch lengths? How sensitive is Cactus to branch lengths? What if we get the branch lengths a little wrong? A lot wrong?

We're just trying to get a sense of how much effort we should dedicate to this aspect of running Cactus.

Thank you so much for your time! Jason Sydes

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.*

joelarmstrong commented 6 years ago

For the topology:

Yes, runtime will suffer badly with a star tree. Having a few polytomies where there is uncertainty isn't a bad thing though, as long as they are kept small. The runtime grows roughly quadratically with the size of the polytomy, and roughly linearly otherwise. We have some evidence that getting the topology "a little" wrong doesn't hurt too much. Like Mark said, we ran a test aligning the same large-ish set of 48 birds with 4 different trees, representing the various species tree guesses within the literature, as well as one with a few random edit operations applied. This covers the "a little" wrong case pretty well. The alignments weren't perfectly identical, but the effect was pretty similar to mere alignment noise. For example, we ran RaXmL on all four, and got the exact same topology (or within a Robinson-Foulds distance of 0.04, depending on the regions used for inference).

For the branch lengths: Cactus uses these mostly to speed up the alignment (it can use less sensitive alignment parameters at shorter evolutionary distances). It shouldn't be a major problem unless the branch lengths are way underestimated (or derived from conserved sequence rather than stuff evolving roughly neutrally). It's best to overestimate them by a bit: give it the high range of your best guess. (Leaving them blank sets them all to 1, which will almost always be an overestimate.)

iminkin commented 6 years ago

What are the units used for the branch lengths?

joelarmstrong commented 6 years ago

The branch lengths should be the neutral rate of nucleotide substitutions per site. (A rough estimate is OK, but estimating it solely from very conserved sequence like exons is probably a bad idea.)

iminkin commented 6 years ago

I think this information as well as #110 deserves adding to the documentation.

jasonsydes commented 5 years ago

Apologies, I realize I should probably close these issues as they receive very good answers. Thank you @joelarmstrong for @diekhans for your answers and insight, they've been very useful.