Open mikemc opened 6 days ago
Thoughts:
My main sticking point is that I don't think it's trivial or super-convenient to run a robust deduping protocol (e.g. something equivalent to clumpify) on the output TSV. We do now keep the quality scores in the TSV, so it would be possible to regenerate the FASTQ, run Clumpify or some other command line tool on that, and then use the output to filter the TSV to produce a deduplicated version, but this is much less convenient than running Clumpify in-pipeline. I'm not excited about switching to a dumber dedup approach (e.g. exact match on the first N bases of the read).
My suggestion for addressing this concern is that we output all reads in the HV table, with columns indicating whether they are 'original' or 'duplicate' according to whatever your preferred dedup approach is. For example, currently the pipeline uses clumpify to determine which reads are original vs duplicate, and throws out the duplicates before Kraken. But you could instead run all reads through kraken, do the bowtie-kraken comparison and final HV assignment on all reads, and separately determine which reads are duplicates according to the current clumpify method, and then mark these reads as duplicates in the table. Then you can just filter on this column downstream if you want.
In other words, having one or more deduplication approaches built into the pipeline is compatible with my feature request, if we're ok running HV assignment on non-deduped reads.
I think we're unlikely to find a single "best" deduplication method any time soon. I would find it very helpful to run all the reads passing the initial bbduk screen through the full HV workflow and output all the reads in the HV table w/o deduplicating. The benefits include
I think there would be a minimal additional computational cost --- we're already running all reads (pre-dedup) through bowtie2; I think we'd just need to also start running all reads through kraken2 and the bowtie2+kraken2 combination step.