glennhickey / progressiveCactus

Distribution package for the Prgressive Cactus multiple genome aligner. Dependencies are linked as submodules
Other
80 stars 26 forks source link

question about outgroups #63

Open cooketho opened 7 years ago

cooketho commented 7 years ago

Let's say I have four genomes: A, B, C and D with tree (((A, B), C), D), and there is a gene "X" I'm interested in that is present in A, B and D, but deleted in genome C. Am I correct that the default behavior of the aligner is that in the output HAL graph the gene X homologs in subtree (A, B) will be connected to each other, but not to the homolog in D? Or does that only happen if C is defined as reference quality (i.e. outgroup) with an asterisk in the newick file? If I want to avoid that behavior, can I just define C as being non-reference quality? The reason I ask is if the sub-alignments are being done from the bottom up, I wonder if the problem is the aligner doesn't know that C has a derived haplotype with a deletion in it. Please let me know if there is some argument I can pass, (e.g. via the xml config file). Thanks!

glennhickey commented 7 years ago

Yeah, If C is your only outgroup for (A,B), I don't think you'll ever get the alignment to D. Support exists for using multiple outgroups to prevent this sort of thing (in your case it would be both D and C while aligning A,B). This is something that @joelarmstrong has done a lot of work on recently, so he may be the best to comment on which options to use..

On Fri, Jul 22, 2016 at 10:20 PM, cooketho notifications@github.com wrote:

Let's say I have four genomes: A, B, C and D with tree (((A, B), C), D), and there is a gene "X" I'm interested in that is present in A, B and D, but deleted in genome C. Am I correct that the default behavior of the aligner is that in the output HAL graph the gene X homologs in subtree (A, B) will be connected to each other, but not to the homolog in D? Or does that only happen if C is defined as reference quality (i.e. outgroup) with an asterisk in the newick file? If I want to avoid that behavior, can I just define C as being non-reference quality? The reason I ask is if the sub-alignments are being done from the bottom up, I wonder if the problem is the aligner doesn't know that C has a derived haplotype with a deletion in it. Please let me know if there is some argument I can pass, (e.g. via the xml config file). Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/glennhickey/progressiveCactus/issues/63, or mute the thread https://github.com/notifications/unsubscribe-auth/AA2_7mdE0_NpXAy2CvRvD7Ez5j32PF8yks5qYXqJgaJpZM4JTQnS .

joelarmstrong commented 7 years ago

We currently use up to 3 outgroups per sub-alignment in the default configuration for exactly this reason. So you should get the alignment from X in A & B to X in D, even X is deleted, or just missing data, in a few intervening genomes.

Our outgroup selection isn't very intelligent--it just picks the 3 closest outgroups--so it's definitely possible that the 3 outgroups may share the deletion, in which case increasing the number of outgroups ("max_num_outgroups" in the cactus_progressive_config.xml file) could help.

That said, there are other reasons we could miss the alignment from A to D. Is this happening to you currently? The multiple-outgroups code has been used by default for years, but we recently (a month or two ago) made some other changes that improved this type of problem (which is usually caused by deletions or missing data) quite a bit.