nickjcroucher / gubbins

Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins
http://nickjcroucher.github.io/gubbins/
GNU General Public License v2.0
175 stars 51 forks source link

New ska version and masking #392

Closed pgcudahy closed 8 months ago

pgcudahy commented 9 months ago

Hello, thanks for the great tool. I had run an analysis a while ago and now a peer reviewer has asked me to alter it. In the meantime my university's compute cluster changed so I had to rebuild my pipeline. With this, I have seen a large change in gubbins' output that I would like to understand. I have a collection of 217 M. kansasii samples along with the FDAARGOS_1615 type strain. Under gubbins v3.2.1 generate_ska_alignment.py produces an alignment that starts with

>FDAARGOS_1615
---------------CTTCTCGTAGATGATGCGTTCGCGAGTTTCCTTGAGGTCCATGCCTTCTCCTGAGTTATGAGGTGATGACGCCGATCTGAACGCCGATGTCGTAGATGCCACCGGCGGCCGCGGTCCACTGCACGGTGCCGGACGCACCGGCCTCGATTTCCTGTTCGGCCTTTTCGGTTGCGATGACGTAGATCGGGGTGCCCTCCTGGACATGCTGGCCGTTTTCGACGAGCAGTTCCACGAGTTCGGCCTCCGACACCGCGACCGACACACGCGGGATGCGAATGACGAAGTCAGCCATGAGATTCGGTCCTCGTCAAAGTCCGCTGTACCGCCGCGACGATCCGTGCCGGCGACGGATACACCTGCGCCTCCAGCGCGGCGGCGGCCGGGCTCGGGACGAACCGTGCACCGACCCGCTCCACGGGCGCGACGAGTTCGCCGAACAATTCCGATTGCAGTATCGCGGCGATCTCGGCCCCGGGCCCGCCGAATTGCACGGCGTCGTGCACCACGACGGCTCTTGTTGTGCGACGGACGGATTCGACGACGGTCTCGACATCCAGCGGCACCAGGGTGCGCAGGTCGACGACCTCGGCGCTGACCCCCTGCTCCTGCAGGGTGGCCGCCGCCGCCAGCGCGTCGTGCACACTGCGCCCGTAGCTGATCAGGCTCACGTCGGTGCCGGGCCGCTTGATCTCGGCCTGCCCCAACGGGATCGAGAAACCGGGGTCGACGGGGACGGGTCCGCGTTTGCCTTGCAGCCGGATGGTTTCGACGAACAGGCACGGGTCTTCGTCGAAGATCGCGGCGGTCAGTAGGCCCTTGCCGTCGCGCGGGGTGGACGGGACGATCACCTTCATCCCGGGAATGTGCATGAACCACGCCTCCAACGTCTGCGAATGCGTGGCCCCGGTGGCCAATCCGGCGTACACCTGGGTCCGCACGGTGATCGGCGCGGTGGTGCGTCCACCCGTCATGAAGCGCAGCTTGGCGGCATGATTGATCAACTGGTCGGCGGCGATGCCGATAAAATCCATGATCATGATCTCGGCCACCGGCAGCATCCCGTCTATCGCCGCGCCGATCGCCGCACCCACGATCGCGGCCTCCGAGATCGGGGTGTCCATGACCCGGTCGTTGCCGTACTTCGTCGACAGGCCCGTGGTGGGTCCGGACGCGCCGGGATCGGCGATGTCCTCCCCGAGCAGGAACACCCGGTCATCGGCTTGTAGCGCCTGATCGAGTGCGAGGTTGAGCGCCTCGCGCATCGTCATCTCTTGTTCGGCCATCTGCCCGGCCTCACACCGGGAATCCGATCGGCGCTGCGTATACGTCACGTTCGAGTTCGTCGGCGGACGGTGAATCAGCGTTCAGCACAGCACTCAAAGCGGTTTCCACGATATGCAGCGCGTCGTCGTCGATGCGGCTGAGTTCGTCCTCGCCGCAGATTCCGGCTTCGAGGAGGTGGTTGCGGAACCGTGGCACCGGGTCGGTCGCCATCGCCGCCGCCAGTTGATCTTTTGGTATATAGGCCATCCGGTCACCGAAGTAGTGGCCACGGAAGCGAAACGTCACGCACTCGATGAACGTGGGACCGCTACCGGCGCGGGCGGCGTCCACCGCTTGGTCGAGGGCCGACACGACCGCCAGCGGGTCGTTGCCATCGACCGCGACACCGGGCATACCGTAACCGGCGGCCCGGTCCGCGACATGCTCGAGCTTCATGGTGGCGGACGTGGGCGTCATCTCCGCGTACCGGTTGTTCTGGCACACGAACACCAACGGCAGGTCCCACAGCGCGGCCATATTGGCCGCCTCATGGAAGGAGCCGGTGTTGGTGGCGCCGTCGCCGAAGCTGACCACCGTGACCCGATCCAGGCCCTTGCGCTTGCCGGCCAGTGCCAGCCCGACGGCGACCGGCGGGCCGGCGCCGACGATGCCGGTCGAAAGCATCACGCCCACCTCAGGATTGGCGA

With gubbins v3.3.3 and the same list of samples it gives

>FDAARGOS_1615
CGCTATCGCGCCGGTCTTCTCGTAGATGATGCGTTCGCGAGTTTCCTTGAGGTCCATGCCTTCTCCTGAGTTATGAGGTGATGACGCNNNNNNNNNNNNNNNNNTCGTAGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGTAGATCGGGGTGCCCTCCTGGACANNNNNNNNNNNNNNNNNNAGCAGTTCCACGAGTTCGGCCTCCGACACCGCGACCGACACACGCGGGATGCGAATGACGAAGTCAGCCATGAGATTCGGTCCTCGTCAAAGTCCGCTGTACCGCCGCGACGATNNNNNNNNNNNNNNNNNACACCTGCGCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCGCTCCACGGNNNNNNNNNNNNNNNNNNNNNATTCCGATTGCAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACGGCTCTTGTTGTGCGNNNNNNNNNNNNNNNNNCGGTCTCGACATCCANNNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNNNNNNNNNNTGCTCCTGCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACTGCGCCCGTAGCTGATCAGGCTCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCAACGGGATCGAGANNNNNNNNNNNNNNNNNACGGGTCCGCGTTTGCCTTGCAGCCGGATGGTTTCGACGAACAGGCACGGGTCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCGTCGCGCGGGGNNNNNNNNNNNNNNNNNNNNNTCCCGGGAATGTGCATGAACCACGCCTCCAACGTCTGCGAATGCGTGGCCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNATCGGCGCGGTGGTGCGTCCACCCGTCATGAANNNNNNNNNNNNNNNNNGATTGATCAACTGNNNNNNNNNNNNNNNNNTAAAATCCATGATCATGATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACGATCGCGGCCTCCGAGATCGGGGTGTCCATGACCCGGTCGTTGCCGTACTTCGTCGACAGGCCCGTGGTGGGTCCGGACGCGCCGGGATCGNNNNNNNNNNNNNNNNNCAGGAACACCCGGTCATNNNNNNNNNNNNNNNNNTCGAGTGCGAGGNNNNNNNNNNNNNNNNNCGTCNNNNNNNNNNNNNNNNNNNNNNNNNNNTCACACCGGNNNNNNNNNNNNNNNNNCGTATACGTCACGTTCGAGTTCGTCGGCGGACGGTGAATCAGCGTTCAGCACAGCACTCAAAGCGGTTTCCACGATATNNNNNNNNNNNNNNNNNNNNNNNNNNNGTTCGTCCTCGCCGCAGATTCCGGCTTCGAGGAGGTGGTTGCGGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGTTGATCTTTTGGTATATAGGCCATCCGGTCACCGAAGTAGTGGCCACGGAAGCGAAACGTCACGCACTCGATGAACGTGGGACCGCTANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGGCNNNNNNNNNNNNNNNNNGGTCGTTGNNNNNNNNNNNNNNNNNGGGCATACCGTAACNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTCATGGTGGCGGACGTGGGCGTCATCTCCGCGTACCGGTTGTTCTGGCACACGAACACCAANNNNNNNNNNNNNNNNNCGGCCATATTGGCCGCCTCATGGAAGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCTGACCACCGTGACCCNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNCCGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGTCNNNNNNNNNNNNNNNNNCTCAGGATTGGCGA

I believe this change is because the flag --repeat-mask was added to ska map. However, now gubbins will fail with Excluded sequence FDAARGOS_1615 because it had 43.94050504976698 percentage missing data while a maximum of 25.0 is allowed and the same error for all of the samples in my collection. Do you have any advice on how to move forward?

Thanks, Patrick

nickjcroucher commented 9 months ago

You can change the percentage missing data used by Gubbins to filter the alignment - worth removing any low quality ingroup sequences before relaxing that criterion though.

pgcudahy commented 9 months ago

Thanks for your reply. How would you recommend removing low quality ingroup sequences?

nickjcroucher commented 9 months ago

Use can use the gubbins_alignment_checker.py script that is included in the package (https://github.com/nickjcroucher/gubbins/blob/master/python/scripts/gubbins_alignment_checker.py). It will identify isolates with high missing base counts, but won't filter them at present.

pgcudahy commented 9 months ago

Thanks, I tried the alignment checker and it shows the same info as the error messages which is that all of my samples have gone from < 10% to now > 40% Ns. Is it still valid to run gubbins with a 50% filter percentage? I was a bit confused by the responses to issues 275 and 359 and whether they apply now that ska masks repeats.

nickjcroucher commented 9 months ago

You can try increasing the k-mer size to improve the mapping to smaller repeated sequences. Do you expect this much of the genome to be repeated sequence? If you think there is a problem with repeat detection in this case, you can raise an issue at https://github.com/bacpop/ska.rust.