marbl / Mash

Fast genome and metagenome distance estimation using MinHash
mash.readthedocs.org
Other
386 stars 90 forks source link

Question about ambiguous k-mers with ambiguous nucleotides #46

Open swamidass opened 7 years ago

swamidass commented 7 years ago

Are k-mers with ambiguous nucleotides (e.g. N) included in the sketch or are they thrown out?

I would imagine the best strategy is to have Mash filter these kmers out. I suppose it could be handled by input processing: breaking fasta sequences into multiple sequences at every ambiguous nucleotide. This does not seem idea.

Thanks.

ondovb commented 7 years ago

They are indeed thrown out; by default only k-mers with ACGT are used.

swamidass commented 7 years ago

Thanks for the quick reply. Sounds like this is handled correctly. My only complaint is that it is not documented clearly here or in the paper. Perhaps this could be noted to the help or documentation. Even more obvious to the user would be to note the number of dropped kmers in with the info.

MKLau commented 6 years ago

A quick note on this. I also had this question upon reading the paper. I found this, http://mash.readthedocs.io/en/latest/sketches.html#strand-and-alphabet, though still left me with the question of how gaps/ambiguous characters would be handled. My recommendation would be for http://mash.readthedocs.io/en/latest/sketches.html#ambiguous-characters section directly after #strand-and-alphabet.

Thanks for all your work on this by the way! This is a great tool.