Outputting all pre-deduplicated reads in the HV workflow

mikemc commented 6 days ago

I think we're unlikely to find a single "best" deduplication method any time soon. I would find it very helpful to run all the reads passing the initial bbduk screen through the full HV workflow and output all the reads in the HV table w/o deduplicating. The benefits include

users can deduplicate (or not) how they want; this will be fast and can often be done locally since it's a small amount of reads
it then becomes very easy to evaluate different dedup methods against each other on HV reads without needing to re-run the pipeline
For use cases where it's useful to have deduplication in directly in the HV workflow, we could just add the duplication assignments to columns in the HV table, making it easy for downstream code to filter out the flagged duplicates.

I think there would be a minimal additional computational cost --- we're already running all reads (pre-dedup) through bowtie2; I think we'd just need to also start running all reads through kraken2 and the bowtie2+kraken2 combination step.

willbradshaw commented 3 days ago

Thoughts:

An additional benefit to moving dedup to post-pipeline is that it would avoid issues @jeffkaufman is worried about around under-deduping due to splitting libraries across flow cell lanes.
There is some pipeline functionality (most obviously clade count generation) that we'd need to drop if we did this, since we couldn't rely on the read counts in the pipeline due to the presence of duplicates. I don't think this is fatal though.
My main sticking point is that I don't think it's trivial or super-convenient to run a robust deduping protocol (e.g. something equivalent to clumpify) on the output TSV. We do now keep the quality scores in the TSV, so it would be possible to regenerate the FASTQ, run Clumpify or some other command line tool on that, and then use the output to filter the TSV to produce a deduplicated version, but this is much less convenient than running Clumpify in-pipeline. I'm not excited about switching to a dumber dedup approach (e.g. exact match on the first N bases of the read).

mikemc commented 3 days ago

My main sticking point is that I don't think it's trivial or super-convenient to run a robust deduping protocol (e.g. something equivalent to clumpify) on the output TSV. We do now keep the quality scores in the TSV, so it would be possible to regenerate the FASTQ, run Clumpify or some other command line tool on that, and then use the output to filter the TSV to produce a deduplicated version, but this is much less convenient than running Clumpify in-pipeline. I'm not excited about switching to a dumber dedup approach (e.g. exact match on the first N bases of the read).

My suggestion for addressing this concern is that we output all reads in the HV table, with columns indicating whether they are 'original' or 'duplicate' according to whatever your preferred dedup approach is. For example, currently the pipeline uses clumpify to determine which reads are original vs duplicate, and throws out the duplicates before Kraken. But you could instead run all reads through kraken, do the bowtie-kraken comparison and final HV assignment on all reads, and separately determine which reads are duplicates according to the current clumpify method, and then mark these reads as duplicates in the table. Then you can just filter on this column downstream if you want.

In other words, having one or more deduplication approaches built into the pipeline is compatible with my feature request, if we're ok running HV assignment on non-deduped reads.

naobservatory / mgs-workflow

Outputting all pre-deduplicated reads in the HV workflow #76