naobservatory / mgs-workflow

3 stars 2 forks source link

Handling reverse-complement duplicates in paired reads #30

Open mikemc opened 1 week ago

mikemc commented 1 week ago

Starting an issue to keep track of this limitation in our current implementations of deduplication and duplicate statistics.

Currently the CLUMPIFY_PAIRED process has the comment flag "NB: Will NOT handle reverse-complement duplicates". I also believe that duplication statistics are currently being generated from FASTQC, which also does not handle reverse-complement duplicates.

mikemc commented 1 week ago

After reading the new Readme, I realize this may be effectively handled by the second pass of clumpify following merging (at least in the taxonomic profiling workflow)

willbradshaw commented 6 days ago

Yeah, this is currently done in the taxonomy subworkflow (which also then gets passed to the second half of the hv subworkflow), as this is where paired reads get merged into single reads.

I'd like to test the behaviour of this step more extensively, to make absolutely sure that it's handling RC duplicates as we expect. Once that's verified, I'm open to copying this process over to other parts of the pipeline as & when it makes sense.