tseemann / mlst

:id: Scan contig files against PubMLST typing schemes
GNU General Public License v2.0
201 stars 47 forks source link

Incorrect output for schemes with number of alleles < 7 when no hits are found by blastn #125

Closed stroehleina closed 2 years ago

stroehleina commented 2 years ago

When running mlst against a sequence file that does not produce any blastn hits and against a scheme that has less than 7 alleles, mlst outputs 7 allele columns (instead of as many allele columns as are defined in the scheme). This occurs both for --legacy mode and for normal mode.

Expected behavior

mlst should output as many allele columns as there are alleles present in the scheme, not 7.

Minimal example to reproduce the error

head debugmlst.fa

>NODE_1_length_260_cov_0.808743
GTTTCATACTTCATCAATCATTCAAACTCCGGTTTGTGTCGGCATCGGATTACGGGTAGC
CAACCGTTGCCTGTCCGCTCCGACATATTCCGACAGTGTGAGATAAGGATTTATTCGATG
AAATCACTCAAAACCTTCCTCATTTGGGGCATAGTGGTACTGGTCGGCTTAGCATCCTTT
ACCACTCTGGCCCTCAGCCGAGGCGAACAGGTCAGCGCGGTATGGATGGTTACCGCCGCC
ATATCCGTTTACTGTATCGC
>NODE_2_length_250_cov_0.855491
GTTCCGCAATACCGATGTTTTCCGCCGCCGCGGTGGGAAAGAAAGCGACGGCGTCAAAGC
CCAGCTCGCCCAGGGTCTGCTCGACCTCAGCCTGAACCTCCGCCACCCTGGCGGCCGCCA
CGCGGTCCGTTTTGGTCAGCGCCACGGTCAGCGTCGGCTTTCCGGTCAACTGCAGGATCG

(gzipped example file debugmlst.fa attached)

conda create -n debugmlst mlst conda activate debugmlst

Make sure mlst is running correctly

mlst --version

mlst 2.22.1

mlst --check

[08:05:29] This is mlst 2.22.1 running on linux with Perl 5.032001
[08:05:29] Checking mlst dependencies:
[08:05:29] Found 'blastn' => /home/astroehlein/.conda/envs/debugmlst/bin/blastn
[08:05:29] Found 'any2fasta' => /home/astroehlein/.conda/envs/debugmlst/bin/any2fasta
[08:05:29] Found blastn: 2.13.0+ (002013)
[08:05:29] OK.

1. Example 1 (--scheme vibrio, 4 alleles)

mlst --debug --scheme vibrio debugmlst.fa

[08:01:44] This is mlst 2.22.1 running on linux with Perl 5.032001
[08:01:44] Checking mlst dependencies:
[08:01:44] Found 'blastn' => /home/astroehlein/.conda/envs/debugmlst/bin/blastn
[08:01:44] Found 'any2fasta' => /home/astroehlein/.conda/envs/debugmlst/bin/any2fasta
[08:01:44] Found blastn: 2.13.0+ (002013)
[08:01:44] Setting --minscore=0 because user chose --scheme
[08:01:44] Running:  any2fasta -q debugmlst\.fa | blastn -db \/home\/astroehlein\/\.conda\/envs\/debugmlst\/db\/blast\/mlst\.fa -num_threads 1 -ungapped -dust no -word_size 32 -max_target_seqs 10000 -perc_identity 95 -evalue 1e-20 -outfmt '6 sseqid slen length nident qseqid qstart qend qseq sstrand'
$VAR1 = [
          'vibrio',
          '-',
          '-/-/-/-/-/-/-',
          0
        ];
debugmlst.fa    vibrio  -   gyrB(-) pyrH(-) recA(-) atpA(-) (-) (-) (-)
[08:01:44] You can follow me on Twitter at @torstenseemann
[08:01:44] Done.

2. Example 2 (--scheme mhyopneumoniae, 3 alleles)

mlst --debug --scheme mhyopneumoniae debugmlst.fa

[08:03:43] This is mlst 2.22.1 running on linux with Perl 5.032001
[08:03:43] Checking mlst dependencies:
[08:03:43] Found 'blastn' => /home/astroehlein/.conda/envs/debugmlst/bin/blastn
[08:03:43] Found 'any2fasta' => /home/astroehlein/.conda/envs/debugmlst/bin/any2fasta
[08:03:43] Found blastn: 2.13.0+ (002013)
[08:03:43] Setting --minscore=0 because user chose --scheme
[08:03:43] Running:  any2fasta -q debugmlst\.fa | blastn -db \/home\/astroehlein\/\.conda\/envs\/debugmlst\/db\/blast\/mlst\.fa -num_threads 1 -ungapped -dust no -word_size 32 -max_target_seqs 10000 -perc_identity 95 -evalue 1e-20 -outfmt '6 sseqid slen length nident qseqid qstart qend qseq sstrand'
$VAR1 = [
          'mhyopneumoniae',
          '-',
          '-/-/-/-/-/-/-',
          0
        ];
debugmlst.fa    mhyopneumoniae  -   adk(-)  rpoB(-) tpiA(-) (-) (-) (-) (-)
[08:03:43] Somehow this tool escaped my normal Australiana based naming system.
[08:03:43] Done.

3. Example 3 (--scheme mflocculare, --legacy mode, 3 alleles)

mlst --debug --legacy --scheme mflocculare debugmlst.fa

[08:23:09] This is mlst 2.22.1 running on linux with Perl 5.032001
[08:23:09] Checking mlst dependencies:
[08:23:09] Found 'blastn' => /home/astroehlein/.conda/envs/debugmlst/bin/blastn
[08:23:09] Found 'any2fasta' => /home/astroehlein/.conda/envs/debugmlst/bin/any2fasta
[08:23:10] Found blastn: 2.13.0+ (002013)
[08:23:10] Setting --minscore=0 because user chose --scheme
FILE    SCHEME  ST  adk rpoB    tpiA
[08:23:10] Running:  any2fasta -q debugmlst\.fa | blastn -db \/home\/astroehlein\/\.conda\/envs\/debugmlst\/db\/blast\/mlst\.fa -num_threads 1 -ungapped -dust no -word_size 32 -max_target_seqs 10000 -perc_identity 95 -evalue 1e-20 -outfmt '6 sseqid slen length nident qseqid qstart qend qseq sstrand'
$VAR1 = [
          'mflocculare',
          '-',
          '-/-/-/-/-/-/-',
          0
        ];
debugmlst.fa    mflocculare -   -   -   -   -   -   -   -
[08:23:10] Thanks for using mlst, I hope you found it useful.
[08:23:10] Done.

debugmlst.fa.zip

stroehleina commented 2 years ago

NB this only occurs when no blastn hits are found. Otherwise mlst creates the expected output columns, even with schemes with < 7 alleles. Two examples for --scheme vibrio (4 alleles) when all and one allele are found, respectively:

mlst --scheme vibrio vibrio_test.fa

[09:38:27] This is mlst 2.22.1 running on linux with Perl 5.032001
[09:38:27] Checking mlst dependencies:
[09:38:27] Found 'blastn' => /home/astroehlein/.conda/envs/debugmlst/bin/blastn
[09:38:27] Found 'any2fasta' => /home/astroehlein/.conda/envs/debugmlst/bin/any2fasta
[09:38:27] Found blastn: 2.13.0+ (002013)
[09:38:27] Setting --minscore=0 because user chose --scheme
[09:38:27] Found exact allele match vibrio.atpA-1
[09:38:27] Found exact allele match vibrio.gyrB-1
[09:38:27] Found exact allele match vibrio.pyrH-1
[09:38:27] Found exact allele match vibrio.recA-1
vibrio_test.fa  vibrio  1   gyrB(1) pyrH(1) recA(1) atpA(1)
[09:38:27] I am quite confident this won't be the last time you run mlst.
[09:38:27] Done.

mlst --scheme vibrio single_allele_vibrio_test.fa

[09:42:02] This is mlst 2.22.1 running on linux with Perl 5.032001
[09:42:02] Checking mlst dependencies:
[09:42:02] Found 'blastn' => /home/astroehlein/.conda/envs/debugmlst/bin/blastn
[09:42:02] Found 'any2fasta' => /home/astroehlein/.conda/envs/debugmlst/bin/any2fasta
[09:42:02] Found blastn: 2.13.0+ (002013)
[09:42:02] Setting --minscore=0 because user chose --scheme
[09:42:03] Found exact allele match vibrio.atpA-1
single_allele_vibrio_test.fa    vibrio  -   gyrB(-) pyrH(-) recA(-) atpA(1)
[09:42:03] Somehow this tool escaped my normal Australiana based naming system.
[09:42:03] Done.

vibrio_test.fa.zip single_allele_vibrio_test.fa.zip