sanger-pathogens / Roary

Rapid large-scale prokaryote pan genome analysis
http://sanger-pathogens.github.io/Roary
Other
324 stars 189 forks source link

Adding an outgroup to core gene alignment #285

Open swuyts opened 8 years ago

swuyts commented 8 years ago

Dear,

We've been using roary with pleasure the last few months, so thanks for this awesome piece of software!

However, we're missing out on one certain feature: adding an outgroup to the core gene alignment file to make a well-rooted tree.

At first we just added a closely related species to the whole process, but as expected, this lead to a lot of N's in the final alignment file. That's when we came up with the following approach:

The slowest step here is our alignment against the core genome. We would like to improve this on a similar way like in roary, namely aligning per gene/group and then concatenating in the end.

Our question basically is: is there a way to invoke MergeMultifastaAlignments.pm given just only the different '.fna.aln' files per group?

andrewjpage commented 8 years ago

Hi, Sounds like a great feature. The merge step can be run independently https://github.com/sanger-pathogens/Roary/blob/master/bin/pan_genome_core_alignment You would need a multifasta file for the genes in your outgroup, and would need to add an extra column at the end to the gene presence and absense spreadsheet saying which of your outgroup genes are in which clusters. I can help with queries for integrating it, but I'm afraid I wouldnt be able to dedicate time to implementing this as part of Roary due to staff shortages. Regards, Andrew

swuyts commented 8 years ago

Hey Andrew,

Thanks! We will try to figure something out with your feedback!

Sander

thkuo commented 7 years ago

Hi @swuyts and @andrewjpage , Can I just have some related questions: why were closely related species expected to cause large amount of Ns in the core gene alignment, and were they resulted from mafft settings?

swuyts commented 7 years ago

With closely relates species I was still talking about an outgroup. So by definition it will have less core genes in common with the rest of the genomes you're analyzing.

One way to still incorporate an outgroup in your analysis is to play with your definition of a core gene. For example, in our case we changed the 'cd' parameter to a lower number (e.g. 95). But the disadvantage here, is that some genes will then be defined as core gene, while they are not actually found in all genomes but in more than 95% of the genomes.

As the outgroup will not share a lot of core genes with the rest of your genomes, roary will put an 'N' in all of the core genes where it did not find a match in the outgroup genome (Correct me if I'm wrong). This will lead to a long string of N's in your core alignment. This is what I meant with a large amount of Ns was expected for such a species.

Kind regards, Sander

thkuo commented 7 years ago

Many thanks, @swuyts! If I understand correctly, the N's are resulted from gene gain/loss events. It also means that Roary presents a single recombination event (eg. deletion of a gene) as multiple mutations in the alignment. I am not so sure whether the alignment need some processing when it is used to infer phylogeny.

swuyts commented 7 years ago

No problem!

Interesting point that you raised at the end. To me it feels that instead of representing this event as N's, it might be better to use dashes to avoid problems in later processing steps.

thkuo commented 7 years ago

Can I have one more short question please: were the N's explained on the website or described in tutorial? I just want to carefully understand details in the tool. For example, if gene loss is presented as continuous N's, I am also curious about how duplicated genes are shown in the alignment (in case that a core gene identified by Roary is not single-copy).

andrewjpage commented 7 years ago

If a core gene contains a paralog (2 or more sequences from the same sample), it is excluded from the core alignment by default. There is a new option --allow_paralogs which includes these genes in the final core.

andrewjpage commented 7 years ago

And @swuyts I agree that dashes would be better than N's to denote missing.