phac-nml / sistr_cmd

SISTR (Salmonella In Silico Typing Resource) command-line tool
Apache License 2.0
25 stars 9 forks source link

which should I select? mash_serovar or serovar? #10

Closed JeanGuy777 closed 8 years ago

JeanGuy777 commented 8 years ago

Hi, first of all, your bioinformatic tool is working very well! However, I am wondering about which column I should use as final serovar result? In rare cases, the serovar in "mash_serovar" column didn't correspond with the serovar in "serovar" column. Can I add a filter, maybe drafted from the cgmlst_distance number, for isolates with a low level of confidence.

Thank you very much, A+

peterk87 commented 8 years ago

Hi Jean,

That's great to hear that the tool is working well for you!

I would recommend using the serovar column value as it is an aggregate of the O/H-antigen prediction and the cgMLST (or Mash) prediction. If you are determining the cgMLST and Mash predictions then only the cgMLST prediction is used for refining the antigen-based prediction.

The Mash prediction should be considered more experimental and suitable for a very quick analysis (BLAST searching all cgMLST alleles against a Salmonella genome can take a while). I've also noticed that Mash is more sensitive to contaminant contigs in assemblies, which can result in fewer matching kmer "sketches". But if your Mash distance is very close to 0 (>900 matching sketches), then you should get similar results as with cgMLST.

Yes, cgmlst_distance can definitely be used to determine the level of confidence in a prediction. Internally, sistr_cmd uses a cgMLST distance threshold of 0.1 (90% similarity) to a set of reference genomes (currently 7511 curated public genomes) for confidentally calling serovar from cgMLST data. Although you could get an accurate serovar prediction with a higher distance threshold (e.g. 0.3 or 70% similarity), that threshold may not hold for all serovars, especially closely related serovars (e.g. Dublin vs Enteritidis).

Also, it would be okay if the cgMLST distance to the closest reference genome were high if the antigen data is complete and the antigen-based serovar call is unambiguous.

JeanGuy777 commented 8 years ago

Hi Peter,

I'm glad you took time to respond as quickly.

A+

2016-09-07 19:15 GMT-04:00 peterk87 notifications@github.com:

Hi Jean,

That's great to hear that the tool is working well for you!

I would recommend using the serovar column value as it is an aggregate of the O/H-antigen prediction and the cgMLST (or Mash) prediction. If you are determining the cgMLST and Mash predictions then only the cgMLST prediction is used for refining the antigen-based prediction.

The Mash prediction should be considered more experimental and suitable for a very quick analysis (BLAST searching all cgMLST alleles against a Salmonella genome can take a while). I've also noticed that Mash is more sensitive to contaminant contigs in assemblies, which can result in fewer matching kmer "sketches". But if your Mash distance is very close to 0 (>900 matching sketches), then you should get similar results as with cgMLST.

Yes, cgmlst_distance can definitely be used to determine the level of confidence in a prediction. Internally, sistr_cmd uses a cgMLST distance threshold of 0.1 (90% similarity) to a set of reference genomes (currently 7511 curated public genomes) for confidentally calling serovar from cgMLST data. Although you could get an accurate serovar prediction with a higher distance threshold (e.g. 0.3 or 70% similarity), that threshold may not hold for all serovars, especially closely related serovars (e.g. Dublin vs Enteritidis).

Also, it would be okay if the cgMLST distance to the closest reference genome were high if the antigen data is complete and the antigen-based serovar call is unambiguous.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/peterk87/sistr_cmd/issues/10#issuecomment-245448094, or mute the thread https://github.com/notifications/unsubscribe-auth/AJ9PjJyZTzgmi78KGkQP_aN666H9fhsfks5qn0V4gaJpZM4J3P62 .

Jean-Guillaume Emond-R, M.Sc. Professionnel de recherche, Institut de Biologie Intégrative et des Systèmes, Université Laval