marbl / Mash

Fast genome and metagenome distance estimation using MinHash
mash.readthedocs.org
Other
385 stars 91 forks source link

Ignore over-occuring kmers? #97

Open tseemann opened 5 years ago

tseemann commented 5 years ago

Would an option to ignore over-occurring kmers make mash more robust against large repeat families and multi-copy plasmids?

mash estimates the coverage in -r mode, and it uses -m for a min freq, but maybe 2*est_cov would be a good max freq?

eg. -M 2 would ignore kmers with freq > 2*est_cov

tseemann commented 5 years ago

I've just realised Finch does something like this already https://github.com/onecodex/finch-rs