tgen / phoenix

Jetstream compatible workflow template supporting comprehensive analysis of human sequencing data against GRCh38
MIT License
17 stars 6 forks source link

Accurate Duplicate Read Count #48

Open PedalheadPHX opened 5 years ago

PedalheadPHX commented 5 years ago

https://github.com/tgen/phoenix/blob/25049309063e72949bed70ad8b9f8e6b41681832/modules/bam_qc.jst#L23

It would be nice if we could get the exact count of duplicate reads as a metric. It seems like samtools always reports the TOTAL read count in the BAM not the UNIQUE read count, so multi-mappers and supplementary reads get counted multiple time.

Notes for NMTRC_0002_2_AG_Whole_T1_KHSTX:

Source Reads Notes
Input FASTQ reads 512975558 expected total read count
Picard CollectAlignmentSummaryMetrics xxxx
PAIR - TOTAL_READS 512975558 matches expectation
Samtools flagstat xxxx
in total 514772770 + 0 counts all reads in bam not just unique
secondary 0 + 0
supplementary 1797212 + 0
duplicates 104680859 + 0
mapped 513884652 + 0
Samtools stats xxxx
raw total sequences 512975558 matches expectation
filtered sequences 0
sequences 512975558
reads mapped 512087440
reads duplicated 104680859
Samtools stats --remove-dups
raw total sequences 513288289 inconsistent does not match flagstat or stats when --remove-dups is not enabled
filtered sequences 104680859 same as duplicate reads in flagstat and stats when --remove-dups is not enabled
sequences 408607430
reads mapped 407945229
reads duplicated 0
Samtools markdup xxxx
READ 514772770 matches flagstat in total count
EXCLUDED 2685330
EXAMINED 512087440
PAIRED 511797328
SINGLE 290112
DULPICATE PAIR 103916294
DUPLICATE SINGLE 225917
DUPLICATE TOTAL 104142211 doesn't match others, unique?

What is the true number of mapped reads flagged as duplicates?

IDXstats Sum = 514772770 . # Matches flagstat awk '{print $3+$4}' NMTRC_0002_2_AG_Whole_T1_KHSTX.bwa.bam.idxstats.txt | awk '{sum+=$1} END {print sum}'

samtools view -@ 4 -f 0x400 -F 0x80 NMTRC_0002_2_AG_Whole_T1_KHSTX.bwa.bam | cut -f1 | sort | uniq | wc -l 52184064

ryanrichholt commented 5 years ago

Would something as simple as piping the bam through a view command with filters setup to drop the reads we don’t want to count?

On Sat, Mar 23, 2019 at 12:09 PM Jonathan Keats notifications@github.com wrote:

https://github.com/tgen/phoenix/blob/25049309063e72949bed70ad8b9f8e6b41681832/modules/bam_qc.jst#L23

It would be nice if we could get the exact count of duplicate reads as a metric. It seems like samtools always reports the TOTAL read count in the BAM not the UNIQUE read count, so multi-mappers and supplementary reads get counted multiple time.

Notes for NMTRC_0002_2_AG_Whole_T1_KHSTX: Source Reads Notes Input FASTQ reads 512975558 expected total read count Picard CollectAlignmentSummaryMetrics xxxx PAIR - TOTAL_READS 512975558 matches expectation Samtools flagstat xxxx in total 514772770 + 0 counts all reads in bam not just unique secondary 0 + 0 supplementary 1797212 + 0 duplicates 104680859 + 0 mapped 513884652 + 0 Samtools stats xxxx raw total sequences 512975558 matches expectation filtered sequences 0 sequences 512975558 reads mapped 512087440 reads duplicated 104680859 Samtools stats --remove-dups raw total sequences 513288289 inconsistent does not match flagstat or stats when --remove-dups is not enabled filtered sequences 104680859 same as duplicate reads in flagstat and stats when --remove-dups is not enabled sequences 408607430 reads mapped 407945229 reads duplicated 0 Samtools markdup xxxx READ 514772770 matches flagstat in total count EXCLUDED 2685330 EXAMINED 512087440 PAIRED 511797328 SINGLE 290112 DULPICATE PAIR 103916294 DUPLICATE SINGLE 225917 DUPLICATE TOTAL 104142211 doesn't match others, unique?

What is the true number of mapped reads flagged as duplicates?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/tgen/phoenix/issues/48, or mute the thread https://github.com/notifications/unsubscribe-auth/ANsPiAYUT4lcKBrqB3QAWTkDzRj1b-91ks5vZnvegaJpZM4cFF5i .

--

This electronic message is intended to be for the use only of the named recipient, and may contain information that is confidential or privileged, including patient health information. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of the contents of this message is strictly prohibited. If you have received this message in error or are not the named recipient, please notify us immediately by contacting the sender at the electronic mail address noted above, and delete and destroy all copies of this message. Thank you.

PedalheadPHX commented 4 years ago

@awchrist Lets resolve this one as you clean up the v0.6 work

PedalheadPHX commented 4 years ago

Can't turn off Picard tools till this is addressed and resolved