marbl / Mash

Fast genome and metagenome distance estimation using MinHash
mash.readthedocs.org
Other
389 stars 91 forks source link

How would you increase the maximum value for k? #69

Open ManoshiDatta opened 6 years ago

ManoshiDatta commented 6 years ago

Hello! I'm enjoying using Mash to characterize population structure for my set of bacterial genomes. Thank you for making this software!

As it turns out, I have some very closely related strains for which higher values of k (e.g., k = 50) might be helpful. However, it seems like the current maximum is k = 32. How would you modify the source code to allow for larger k values?

Thanks!

ondovb commented 6 years ago

The limit is just imposed by argument checking; specifically, see line 168 of Command.cpp. But note that the current limit of 32 comes from the fact that at most 64 bits of the hash will be used, so longer k-mers could have hash collisions. This isn't necessarily a problem, but just be aware that we haven't really tested with longer k-mers.