Allow Genrich to use reads marked as duplicate

jsh58 / Genrich

Detecting sites of genomic enrichment

MIT License

182 stars 27 forks source link

Allow Genrich to use reads marked as duplicate #66

Closed dariober closed 3 years ago

dariober commented 3 years ago

Hi- As mentioned in this issue

Genrich does not analyze alignments already marked as duplicates. It also skips supplementary and low quality alignments ("not > passing filters").

Would it be possible to let the user decide whether reads marked as duplicate should be discarded? I have libraries sequenced at high depth with duplicates marked which I would like to keep. Thanks!

jsh58 commented 3 years ago

Thanks for the question. I do not think it makes sense to alter the Genrich code for this. If you want to modify the bitwise FLAGs in your BAM, that can be accomplished pretty easily with samtools and awk (or bioawk).

dariober commented 3 years ago

Ok, thanks for replying. It's your call of course. If anyone lands here with the same question, here's what I've done. Starting with coordinate sorted bam file, sort by read name and remove the duplicate read flag:

samtools sort -n -@ 8 {input.bam} \
| samtools view -h \
| awk -v FS='\t' -v OFS='\t' '{if(and($2, 1024) == 1024 && $1 !~ "^@") {
                                $2 = $2 - 1024
                              }
                              print $0}' \
| samtools view -@ 4 -b > {output.bam}

SunScript0 commented 2 years ago

I don't think it is a good design choice to remove the duplicates without the user setting the -r option, its very confusing behaviour if you are not expecting it and it is something easy to miss. I do think you should change it so Genrich only removed pcr duplicates if -r is set, but at the very least this should be clear under the description of -r. Somemthing like "Note that if reads have been previously marked as duplicates Genrich will remove them even if -r is not set"