marbl / Mash

Fast genome and metagenome distance estimation using MinHash
mash.readthedocs.org
Other
390 stars 90 forks source link

Maximum copy number/high-pass filtering? #42

Closed philippbayer closed 8 years ago

philippbayer commented 8 years ago

I've been using Mash with great success with plant read data, but I'm running into one weirdness. Some of my cultivars cluster in a way they shouldn't and I believe this is due to chloroplast differences.

Mash lets you specify a minimum copy number to get rid of sequencing error (-r) but it doesn't let you specify a maximum copy number. With plastids I'd remove everything with more than 50 or 70 copies and I can do that first using for example bbmap's bbnorm, it would just be easier to do it directly in Mash.

I'm unsure whether plastid copy numbers influence the -c cutoff so I'd rather filter first. Are there any plans to implement a maximum copy filter?

aphillippy commented 8 years ago

Hi Philipp, I can't think of a way to both build the sketch and filter high-copy mers in a single pass like Mash currently does for low-copy mers. I think it would require a separate pass to (approximately) count mers before building the sketch. If we did this in the future, we could implement a weighted-minhash scheme like we do in the Canu overlapper to down-weight repetitive mers.

Though I think the ultimate solution to your problem will be the implementation of a core-genome distance that only considers the "shared" sequences between two genomes. This would more closely approximate a core-genome SNP distance, and would not be affected by differences in plastids/plasmids/mobile elements/whatever. There are tentative plans to work on such an extension, but it's a ways off.

Best, -Adam

philippbayer commented 8 years ago

Hi Adam, thank you very much for this detailed answer! I agree, it's not best to have a second pass as the speed of Mash is what makes it so useful. As a side-note, I've also had some nice results with flow cytometry sorted wheat chromosome arms (https://www.ncbi.nlm.nih.gov/bioproject/PRJEB3955), the final distance matrix easily distinguishes libraries from different chromosome arms and libraries from the same arm cluster together, something to maybe think about.