tseemann / mlst

:id: Scan contig files against PubMLST typing schemes
GNU General Public License v2.0
192 stars 45 forks source link

--novel outputting "odd" loci #84

Closed cimendes closed 5 years ago

cimendes commented 5 years ago

Hello!

First of all thank you for creating such a useful tool! We've implemented it in our routine pipeline and so far it's been working great for us. :)

Recently we've added the --novel option to the mlst command, that runs in autodetect mode, to save the novel alleles and we've been noticing that alleles belonging to a species that is not present in the sample are reported. For example our BMC1445c, a WGS of a Streptococcus pneumoniae sample, has the following mlst result:

#scheme
spneumoniae
#ST
-
#profile
aroE(7) gdh(15) gki(2) recP(~10) spi(6) xpt(1) ddl(22)

I'm expecting the novel alleles file to contain the sequence for the recP gene, in the spneumoniae scheme, but I get the following:

>soralis.gki~32 BMC1445c.contigs.length_GCcontent_kmerCov.mappingCov.polished.fasta
ACCCTTCAACCAATCAAACAAAAGATTGAAAAAGCTTTGGGCATTCCATTTTTCATCGATAATGATGCCAACGTAGCAGCTCTTGGTGAGCGCTGGATGGGTGCTGGAGATAACCAACCAGACGTTGTCTTTATGACACTCGGTACTGGTGTTGGTGGCGGTATCGTCGCAGAAGGCAAATTGCTTCACGGTGTTGCTGGTGCAGCAGGTGAGCTTGGTCACATCACTGTTGACTTTGACCAGCCAATCTCATGTACTTGCGGTAAGAAAGGCTGCCTTGAGACAGTTGCTTCAGCAACAGGGATTGTCAACTTGACTCGTCGCTATGCCGATGAATACGAAGGCGATGCAGCCTTGAAACGCTTGATTGATAATGGAGAAGAAGTAACTGCTAAGACTGTCTTTGATCTCGCAAAAGAAGGAGACGACCTT
>soralis.recP~11 BMC1445c.contigs.length_GCcontent_kmerCov.mappingCov.polished.fasta
CTGTAAAGGAATCCTTTGTCTCACCATCCAAGTTGATATCATTTGAATCATAAAGAACAACCAACTTATCAAGTTTTTGCAAGCCTGCGTATGAAGCTGCCTCGCTTGAGACACCTTCCATCAAGTCTCCGTCTCCACAGATAACGTAAGTATAGTGGTCAAAGATATTGTAGCCTTCACGGTTATATTTGGCTGCCAAGAAACGTTCTGCTTGGGCAAAACCAGTAGCAGTTGAAATCCCTTGCCCTAGAGGACCTGTCGTAGCATCAATCCCTGCCGTATGACCAAATTCTGGGTGACCTGGTGTTTTTGAACCCCATTGACGGAAGCTCTTAATCTCATCCATGCTGACATCTTCAAAACCAGAAAGGTGAAGAAGAACATAAAGGAGCATTGAACCATGACCTGCTGAAAGAATAAAGCGGTCGCGGTTAATCCAGTTTGGTTG
>soralis.xpt~1 BMC1445c.contigs.length_GCcontent_kmerCov.mappingCov.polished.fasta
ATCCTCAAGGTAGATTCCTTTTTAACCCACCAAGTTGACTTTAGCTTGATGCGAGAGATTGGTAAGGTTTTTGCGGAAAAATTTGCTGCTACTGGCATTACCAAGGTCGTAACCATTGAAGCGTCGGGTATTGCCCCAGCCGTTTTTACAGCTGAAGCCTTAAACGTTCCCATGATTTTCGCCAAAAAAGCTAAGAACATCACCATGAACGAAGGCATCTTAACTGCTCAAGTCTACTCCTTTACCAAGCAGGTGACCAGCACTGTTTCTATCGCTGGAAAATTCCTCTCACCAGAGGACAAGGTTTTGATTATCGACGATTTCCTTGCTAATGGCCAAGCTGCTAAAGGCTTGATTCAAATCATCGAACAGGCCGGTGCCACAGTCCAAGCTATCGGTATCGTGATTGAGAAATCCTTCCAAGATGGTCGTGATTTGCTTGAAAAAGCAGGCTACCCTGTCCTATCACTTGCTCGCTTGGATCGTTT
>soralis.aroE~1 BMC1445c.contigs.length_GCcontent_kmerCov.mappingCov.polished.fasta
GCTTGGGAGATTGAAGCGAGTGACTTGGTAGAAACAGTGGCCAATATTCGTCGCTACCAGATGTTTGGCATCAATCTGTCCATGCCCTATAAGGAGCAGGTGATTCCTTATTTGGATAAGCTGAGCGATGAAGCGCGCTTGATTGGTGCGGTTAATACGGTTGTCAATGAGAATGGCAATTTAATTGGATATAATACAGATGGCAAGGGATTTTTTAAGTGCTTGCCTTCTTTTACAATTTCAGGTAAAAAGATGACCCTGCTGGGTGCAGGTGGTGCGGCTAAATCAATCTTGGCACAGGCTATTTTGGATGGCGTCAGTCAGATTTCGGTCTTTGTTCGTTCCGTTTCTATGGAAAAAACAAGACCTTACCTAGACAAGTTACAGGAGCAGACAGG
>spneumoniae.recP~10 BMC1445c.contigs.length_GCcontent_kmerCov.mappingCov.polished.fasta
CTGTAAAGGAATCCTTTGTCTCACCATCCAAGTTGATATCATTTGAATCATAAAGAACAACCAACTTATCAAGTTTTTGCAAGCCTGCGTATGAAGCTGCCTCGCTTGAGACACCTTCCATCAAGTCTCCGTCTCCACAGATAACGTAAGTATAGTGGTCAAAGATATTGTAGCCTTCACGGTTATATTTGGCTGCCAAGAAACGTTCTGCTTGGGCAAAACCAGTAGCAGTTGAAATCCCTTGCCCTAGAGGACCTGTCGTAGCATCAATCCCTGCCGTATGACCAAATTCTGGGTGACCTGGTGTTTTTGAACCCCATTGACGGAAGCTCTTAATCTCATCCATGCTGACATCTTCAAAACCAGAAAGGTGAAGAAGAACATAAAGGAGCATTGAACCATGACCTGCTGAAAGAATAAAGCGGTCGCGGTTAATCCAGTTTGGTTGAG
>soralis.gdh~11 BMC1445c.contigs.length_GCcontent_kmerCov.mappingCov.polished.fasta
GAACACTTTATCCGTGGACAATACCGCTCTGGTAAGATTGATGGCATGAAATACATCTCTTATCGTAGCGAACCAAATGTGAATCCAGAATCAACAACTGAAACCTTTACATCTGGTGCCTTCTTTGTAGACAGCGATCGATTCCGTGGTGTTCCTTTCTTTTTCCGTACAGGTAAACGACTGACTGAAAAAGGAACTCATGTCAACATCGTCTTTAAACAAATGGATTCTATATTTGGAGAACCACTTGCTCCAAATATTTTGACCATCTATATTCAACCAACAGAAGGCTTCTCTCTTAGCCTAAATGGGAAGCAAGTAGGAGAAGAATTTAACTTGGCTCCTAACTCACTTGATTATCGTACAGACGCGACTGCAACTGGTGCTTCTCCAGAACCATACGAGAAATTGATTTATGATGTCCTAAATAACAACTCAACTAACTTTAGCCACTGGGAT
>spneumoniae.gdh~497 BMC1445c.contigs.length_GCcontent_kmerCov.mappingCov.polished.fasta
AGAACTCAAAGAACACTTTATCCGTGGACAATACCGCTCTGGTAAGATTGATGGCATGAAATACATCTCTTATCGTAGCGAACCAAATGTGAATCCAGAATCAACAACTGAAACCTTTACATCTGGTGCCTTCTTTGTAGACAGCGATCGATTCCGTGGTGTTCCTTTCTTTTTCCGTACAGGTAAACGACTGACTGAAAAAGGAACTCATGTCAACATCGTCTTTAAACAAATGGATTCTATATTTGGAGAACCACTTGCTCCAAATATTTTGACCATCTATATTCAACCAACAGAAGGCTTCTCTCTTAGCCTAAATGGGAAGCAAGTAGGAGAAGAATTTAACTTGGCTCCTAACTCACTTGATTATCGTACAGACGCGACTGCAACTGGTGCTTCTCCAGAACCATACGAGAAATTGATTTATGATGTCCTAAATAACAACTCAACTAACTTTAGC

The recP(~10) allele, is reported, as expected, but also a bunch of "novel" alleles for the soralis scheme. At first we though it might be some contamination with soralis, but after running Kraken2 and ReMatCh in mlst mode, we're fairly certain that there is no contamination as there's only pneumo sequences (with little unclassified) and no multiple mlst alleles are present.

I've attached the assembly for this particular to this issue. Thank you very much for your help!

BMC1445c.contigs.length_GCcontent_kmerCov.mappingCov.polished.fasta.zip

tseemann commented 5 years ago

I have commenced examing this, and let me tell you, my --novel code is a total shambles :-)

tseemann commented 5 years ago

I have fixed this now! Thanks for letting me know. I have updated docs too:

You can also save the "novel" alleles for submission to PubMLST::

% mlst -q --novel nouveau.fa s_myces.fasta

% cat nouveau.fa

>streptomyces.recA-e562a2cd93e701e3b58ba0670bcbba0c s_myces.fasta
GACGTGGCCCTCGGCGTCGGCGGTCTGCCGCGCGGCCGCGTCGTCGAGATCTACGGACCGGAGTCCTCC...

The format of the sequence IDs is scheme.allele-hash filename 
where hash is the hexadecimal MD5 digest of the allele DNA sequence.
cimendes commented 5 years ago

:fireworks: Thank you so much Torsten! :fireworks: