Closed philippbayer closed 8 years ago
Hi Philipp, I can't think of a way to both build the sketch and filter high-copy mers in a single pass like Mash currently does for low-copy mers. I think it would require a separate pass to (approximately) count mers before building the sketch. If we did this in the future, we could implement a weighted-minhash scheme like we do in the Canu overlapper to down-weight repetitive mers.
Though I think the ultimate solution to your problem will be the implementation of a core-genome distance that only considers the "shared" sequences between two genomes. This would more closely approximate a core-genome SNP distance, and would not be affected by differences in plastids/plasmids/mobile elements/whatever. There are tentative plans to work on such an extension, but it's a ways off.
Best, -Adam
Hi Adam, thank you very much for this detailed answer! I agree, it's not best to have a second pass as the speed of Mash is what makes it so useful. As a side-note, I've also had some nice results with flow cytometry sorted wheat chromosome arms (https://www.ncbi.nlm.nih.gov/bioproject/PRJEB3955), the final distance matrix easily distinguishes libraries from different chromosome arms and libraries from the same arm cluster together, something to maybe think about.
I've been using Mash with great success with plant read data, but I'm running into one weirdness. Some of my cultivars cluster in a way they shouldn't and I believe this is due to chloroplast differences.
Mash lets you specify a minimum copy number to get rid of sequencing error (-r) but it doesn't let you specify a maximum copy number. With plastids I'd remove everything with more than 50 or 70 copies and I can do that first using for example bbmap's bbnorm, it would just be easier to do it directly in Mash.
I'm unsure whether plastid copy numbers influence the -c cutoff so I'd rather filter first. Are there any plans to implement a maximum copy filter?