marbl / Mash

Fast genome and metagenome distance estimation using MinHash
mash.readthedocs.org
Other
379 stars 90 forks source link

Can Mash accurately classify subspecies? #172

Open rpalcab opened 2 years ago

rpalcab commented 2 years ago

Hello,

I'm currently working on Mycobacterium caprae and Mycobacterium bovis. These subspecies of the M. tuberculosis complex are phylogenetically very similar, so the task of identifying them is not always trivial.

In one of my analysis, I expected all the samples to be M. caprae, but when looking at the Mash screen results I find that many of them could be assigned to both subspecies, since they got the same shared-hashes score and p-value, or just a difference of 1 in the shared-hashes score.

Sample A

0.99957 991/1000    77  0   GCF_001941665.1_ASM194166v1_genomic.fna.gz  NZ_CP016401.1 Mycobacterium caprae strain Allgaeu genome
0.99957 991/1000    77  0   GCF_001483905.1_ASM148390v1_genomic.fna.gz  NZ_CP013741.1 Mycobacterium bovis strain BCG-1 (Russia), complete genome
0.99957 991/1000    77  0   GCF_001274555.1_ASM127455v1_genomic.fna.gz  NZ_CP009243.1 Mycobacterium bovis BCG strain Russia 368, complete genome

Sample B

0.999377    987/1000    193 0   GCF_000195835.1_ASM19583v1_genomic.fna.gz   NC_002945.3 Mycobacterium bovis AF2122/97 chromosome, complete genome
0.999329    986/1000    193 0   GCF_001941665.1_ASM194166v1_genomic.fna.gz  NZ_CP016401.1 Mycobacterium caprae strain Allgaeu genome
0.999329    986/1000    193 0   GCF_001580385.1_ASM158038v1_genomic.fna.gz  NZ_CP014566.1 Mycobacterium bovis BCG str. Tokyo 172 substrain TRCS, complete genome

This makes me wonder whether Mash screen is able to identify in a subspecies level. Also, is a difference of 1 in the shared-hashes score robust enough to determine the taxonomy of an organism?

Thanks in advance

sheikki commented 8 months ago

In my experience, k-mer size of 17 (-k 17) and sketch size of 50000 (-s 50000) is enough for differentiating Salmonella serovars. The default sketch size of just 1000 certainly doesn't provide enough resolution for subspecies etc.