marbl / Mash

Fast genome and metagenome distance estimation using MinHash
mash.readthedocs.org
Other
389 stars 90 forks source link

Question : Custom parameters for long and erroneous ONT reads #95

Open RxLoutre opened 6 years ago

RxLoutre commented 6 years ago

Hello there,

I've been trying to understand the principle of the MinHash algorithm. Although it is not the field I am used to, I understand that mash use MinHash in order to compress a sequence using the representative k-mers (it is called shingles right ?). That set of representativ k-mer is then encoded on 32 or 64 bits in a particular format that doesnt take a lot of memory but still allow comparison. I'm I right so far ?

I'm working with Oxford Nanopore Reads, which may have a higher but evenly distributed error rate (~15%), with mostly insertion/deletion in homopolymers. I'm trying to use mash in order to check the content of my sketched raw reads, for now against the mashed RefSeq database I found on this place : https://mash.readthedocs.io/en/latest/data.html .

I did several tests, each time with species that are present in RefSeq, but mash doesn't seem to work well and don't even show the right species in the first hits. For example, for data on a fish species, my 150 first result where mammals species.

I feel like my high error rate in my reads will make it really hard to have a correct estimation. I would like to solve that problem. I'm already trying to perform a miniassembly of my reads (using minimap2/miniasm) to produce a kind of consensus to work with mash, but this is adding a really long step to the analysis without making a huge improvement on the results accuracy.

I think there might be a way to play with the k parameter and maybe some other parameters I'm not thinking of while sketching with mash. I would like to know if anyone as a suggestion of parameters I could try to fit long and erroneous ONT reads.

Thanks in advance for your advices !

Cheers,

Rox

ondovb commented 6 years ago

I would suggest -m 2 if you're not already using it to filter out some of the read errors.