tseemann / mlst

:id: Scan contig files against PubMLST typing schemes
GNU General Public License v2.0
196 stars 46 forks source link

--minscore 100 reports no scheme even if all alleles match #39

Closed VGalata closed 6 years ago

VGalata commented 7 years ago

Dear Torsten,

I would like to report schemes only if all alleles have a match and, thus, I used --minscore 100. However, the tool reports no scheme though all alleles of the best scheme could be matched. Does the parameter mean that the result should be > the given minimal threshold and not >=?

Here is a reproducible example using an E. coli sample from NCBI RefSeq:

mlst --nopath --debug --minscore 100 GCF_001901025.1_ASM190102v1_genomic.fna.gz
...
[13:30:56] SCORE=45 kpneumoniae -   ~144/~115/-/~166/-/~146/-   (7 genes)
[13:30:56] SCORE=100    ecoli   1081    6/4/5/18/11/8/2 (7 genes)
[13:30:56] SCORE=7  koxytoca    -   -/-/-/-/-/24?/- (7 genes)
[13:30:56] SCORE=91 senterica   -   ~622/559/552/764/~686/~580/695  (7 genes)
[13:30:56] SCORE=11 paeruginosa -   -/-/~104/-/-/-/-    (7 genes)
[13:30:56] SCORE=22 cronobacter -   ~104/~92/-/-/-/-/-  (7 genes)
[13:30:56] SCORE=11 ecloacae    -   -/-/-/-/-/~106/-    (7 genes)
[13:30:56] SCORE=48 cfreundii   -   ~69/-/79?/~62/~43/61?/- (7 genes)
$VAR1 = [
          '-',
          '-',
          '-/-/-/-/-/-/-',
          0
        ];
[13:30:56] Deleting temporary files: /tmp/cXwRDDH70T
GCF_001901025.1_ASM190102v1_genomic.fna.gz  -   -
mlst --nopath --debug --minscore 99 GCF_001901025.1_ASM190102v1_genomic.fna.gz
...
[13:26:25] SCORE=11 ecloacae    -   -/-/-/-/-/~106/-    (7 genes)
[13:26:25] SCORE=48 cfreundii   -   ~69/-/79?/~62/~43/61?/- (7 genes)
[13:26:25] SCORE=11 paeruginosa -   -/-/~104/-/-/-/-    (7 genes)
[13:26:25] SCORE=91 senterica   -   ~622/559/552/764/~686/~580/695  (7 genes)
[13:26:25] SCORE=45 kpneumoniae -   ~144/~115/-/~166/-/~146/-   (7 genes)
[13:26:25] SCORE=100    ecoli   1081    6/4/5/18/11/8/2 (7 genes)
[13:26:25] SCORE=7  koxytoca    -   -/-/-/-/-/24?/- (7 genes)
[13:26:25] SCORE=22 cronobacter -   ~104/~92/-/-/-/-/-  (7 genes)
$VAR1 = [
          'ecoli',
          '1081',
          '6/4/5/18/11/8/2',
          100
        ];
$VAR2 = [
          '-',
          '-',
          '-/-/-/-/-/-/-',
          0
        ];
...
GCF_001901025.1_ASM190102v1_genomic.fna.gz      ecoli   1081    adk(6)  fumC(4) gyrB(5) icd(18) mdh(11) purA(8) recA(2)

Used tool versions: mlst-2.9, blastn: 2.2.31+, perl 5, version 22, subversion 1 (v5.22.1)

tseemann commented 7 years ago

@VGalata sorry for missing this issue when you submitted it 3 weeks ago.

You have indeed found a bug. I was using S > minscore instead of S >= minscore.

That said, there are some things you should know, and I should improve in the docs.

The scoring used to be "100/N" per allele for a scheme with N alleles. However it was changed to be "90/N" and there is 10 points for also having a known ST type. This allows us to differentiate between having N alleles (but novel scheme) and N alleles (existing scheme).

So if you are on ecoli with 7 genes you will need --minscore = 90 for all intact allelles, and with my bug fix, --minscore 100.

I will create an issue to improve the docs.

Thank you so much for your report! The --minscore is quite new.

VGalata commented 7 years ago

@tseemann,

Thank you very much for your reply!

Is the scoring scheme with "90/N" vs. "100/N" for known/unknown ST in the last release (version 2.9)? I assume not because I get known and unknown STs with --minscore 99. Or did I miss something?

Thank you in advance!

tseemann commented 7 years ago

@VGalata Hmmm. Can you send me the FASTA file that is causing this and I will see what is happening?

VGalata commented 7 years ago

@tseemann

I cannot send an example for an unknown ST (because of some data policy issues). But I would use the example in my first post for the other case of a known ST:

I used an E. coli genome from NCBI: GCF_001901025.1_ASM190102v1. With mlst version 2.9 I get SCORE=100 ecoli 1081 6/4/5/18/11/8/2 (7 genes) (see the output in my first post). As the score is 100 and the ST is also known I assume that the scoring scheme "90/N for known STs" is not used in this version. Is that correct?

tseemann commented 6 years ago

@VGalata sorry for the long delay here, i am prepping for a 2.10 release and want to resolve this. i have checked the code and minscore=90 will allow all matching alleles without a known ST, and 100 requires an ST as well. I will document this better.

% mlst -q --minscore GCA_001901025.1_ASM190102v1_genomic.fna.gz    
ecoli   1081 adk(6)     fumC(4) gyrB(5) icd(18) mdh(11)   purA(8) recA(2)
VGalata commented 6 years ago

I just saw that my last reply did not make really sense. However, regarding the issue with 90/N and 100/N, I get a 100 score for complete profiles with an unknown ST, here is an example output:

SCORE=100   kpneumoniae -   16/24/21/27/153/22/67   (7 genes)

This is not a big problem in my case, I just would like to be sure I understand correctly which scoring scheme is used and how the results are filtered. I will try to find an example to reproduce such results. (update: see my next comment)

I look forward to the new release!

VGalata commented 6 years ago

Update

Here is an example file - ERS1018585.fna.gz - of a P. aeruginosa I downloaded some time ago from here.

I get the following result

[07:41:02] SCORE=100    paeruginosa -   6/101/4/11/1/4/2    (7 genes)

using

mlst --debug --minid 99 --mincov 75 --minscore 99
tseemann commented 6 years ago

Are you using 2.10 version? https://github.com/tseemann/mlst/releases I think i released the same day you wrote the above comment!

% mlst ERS1018585.fna.gz  --debug

[16:44:38] SCORE=6      achromobacter   -       -/-/-/-/155?/-/-        (7 genes)
[16:44:38] SCORE=6      cronobacter     -       -/-/-/-/-/-/167?        (7 genes)
[16:44:38] SCORE=6      pfluorescens    -       56?/-/-/-/-/-/- (7 genes)
[16:44:38] SCORE=90     paeruginosa     -       6/101/4/11/1/4/2        (7 genes)

% mlst --version
mlst 2.10