tseemann / mlst

:id: Scan contig files against PubMLST typing schemes
GNU General Public License v2.0
201 stars 47 forks source link

Handling of alleles that are subsequences of other alleles #62

Closed kriskiil closed 6 years ago

kriskiil commented 6 years ago

In the Senterica scheme the presence of allele aroC-5 results in aroC(5,807,819), since aroC-5 is a supersequence of aroC-807 and aroC-819. Warnings are issued, but no ST is assigned. I'm at a loss as to why subsequences are included as new alleles in the MLST database, but that is of course not your fault. When that is the state of the database, however, I think it should be handled in a transparent manner.

A proposed solution would be to report only the longest exact match, when alternative alleles are covered by that match.

Example output:

[09:58:16] Found 'blastn' => /tools/miniconda3/envs/env_serumqc/bin/blastn [09:58:16] Found 'gzip' => /usr/bin/gzip [09:58:16] Found 'file' => /tools/miniconda3/bin/file [09:58:16] Excluding 2 schemes: abaumannii ecoli_2 [09:58:16] Scanning: contigs.fasta [1805H3669_contigs.fasta] [09:58:18] Found exact allele match senterica.purE-5 [09:58:18] Found exact allele match senterica.thrA-58 [09:58:18] Found exact allele match senterica.hisD-12 [09:58:18] Found exact allele match senterica.dnaN-14 [09:58:18] Found exact allele match senterica.aroC-5 [09:58:18] WARNING: found addtional exact allele match senterica.aroC-807 [09:58:18] WARNING: found addtional exact allele match senterica.aroC-819 [09:58:18] Found exact allele match senterica.hemD-6 [09:58:18] Found exact allele match ecoli.recA-152 [09:58:18] Found exact allele match senterica.sucA-14 contigs.fasta senterica - aroC(5,807,819) dnaN(14) hemD(6) hisD(12) purE(5) sucA(14) thrA(58)

Expected output of last line:

contigs.fasta senterica 166 aroC(5) dnaN(14) hemD(6) hisD(12) purE(5) sucA(14) thrA(58)

tseemann commented 6 years ago

Yes we discovered this too. We contacted the curator @happykhan and he says it is a mistake in the scheme which will be fixed.

But it did highlight that I need to add -culling_limit 1 back to my blastn call!

happykhan commented 5 years ago

Please email enterobase@warwick.ac.uk if pain persists.