naobservatory / mgs-workflow

MIT License
3 stars 2 forks source link

Outputting all pre-deduplicated reads in the HV workflow #76

Open mikemc opened 6 days ago

mikemc commented 6 days ago

I think we're unlikely to find a single "best" deduplication method any time soon. I would find it very helpful to run all the reads passing the initial bbduk screen through the full HV workflow and output all the reads in the HV table w/o deduplicating. The benefits include

  1. users can deduplicate (or not) how they want; this will be fast and can often be done locally since it's a small amount of reads
  2. it then becomes very easy to evaluate different dedup methods against each other on HV reads without needing to re-run the pipeline
  3. For use cases where it's useful to have deduplication in directly in the HV workflow, we could just add the duplication assignments to columns in the HV table, making it easy for downstream code to filter out the flagged duplicates.

I think there would be a minimal additional computational cost --- we're already running all reads (pre-dedup) through bowtie2; I think we'd just need to also start running all reads through kraken2 and the bowtie2+kraken2 combination step.

willbradshaw commented 3 days ago

Thoughts:

mikemc commented 3 days ago

My main sticking point is that I don't think it's trivial or super-convenient to run a robust deduping protocol (e.g. something equivalent to clumpify) on the output TSV. We do now keep the quality scores in the TSV, so it would be possible to regenerate the FASTQ, run Clumpify or some other command line tool on that, and then use the output to filter the TSV to produce a deduplicated version, but this is much less convenient than running Clumpify in-pipeline. I'm not excited about switching to a dumber dedup approach (e.g. exact match on the first N bases of the read).

My suggestion for addressing this concern is that we output all reads in the HV table, with columns indicating whether they are 'original' or 'duplicate' according to whatever your preferred dedup approach is. For example, currently the pipeline uses clumpify to determine which reads are original vs duplicate, and throws out the duplicates before Kraken. But you could instead run all reads through kraken, do the bowtie-kraken comparison and final HV assignment on all reads, and separately determine which reads are duplicates according to the current clumpify method, and then mark these reads as duplicates in the table. Then you can just filter on this column downstream if you want.

In other words, having one or more deduplication approaches built into the pipeline is compatible with my feature request, if we're ok running HV assignment on non-deduped reads.