jsh58 / Genrich

Detecting sites of genomic enrichment
MIT License
185 stars 27 forks source link

Making sense of verbose output, especially re PCR duplicates #55

Closed levlitichev closed 4 years ago

levlitichev commented 4 years ago

Hello,

Despite reading the README and searching online, I'm still having difficulty making sense of the verbose output of Genrich. Take the example included in the README:

Processing experimental file #0: SRR5427886.bam
  BAM records analyzed:  146277509
    Unmapped:              2877550
    To skipped refs:       2560688
      (chrM,chrY)
    Paired alignments:   137462998
      secondary alns:     67724920
    Unpaired alignments:   3376273
      secondary alns:      2232056
  PCR duplicates --
    Paired aln sets:      35574220
      duplicates:          4660723 (13.1%)
    Discordant aln sets:     64679
      duplicates:             7736 (12.0%)
    Singleton aln sets:    1036778
      duplicates:           475201 (45.8%)
  Fragments analyzed:     31286234
    Full fragments:       30616993
      (avg. length: 226.4bp)
    Half fragments:         669241
      (from unpaired alns, extended to 226bp)

It makes sense that Unmapped + To skipped refs + Paired alignments + Unpaired alignments = BAM records analyzed. However, I expected Paired aln sets to be the same as Paired alignments, but there is a substantial drop-off. Same with Unpaired alignments and Singleton aln sets.

My best guess is that "set" here is meant in the mathematical sense, so Paired aln sets is also the number of unique paired alignments. But if that's the case, then I don't understand why Full fragments is less than Paired aln sets, unless alignments are thrown out for some reason besides PCR duplication. Is there some additional filtering happening under the hood?

If you could help me understand the relationships between some of these numerical outputs, I would greatly appreciate it. Thanks very much in advance!

Best regards, Lev

jsh58 commented 4 years ago

Thanks for the question. It can definitely be confusing. The term aln sets refers to all of the alignments of a read/fragment.

The discrepancy you noted is alluded to in the description of the -r option:

[...] all alignments to skipped chromosomes (-e) or genomic regions (-E) are still evaluated.

Therefore, the To skipped refs alignments are analyzed along with the Paired alignments and Unpaired alignments when determining PCR duplicates. The reason why this is done is for reads/fragments, such as those in this BAM file, that have secondary alignments.

levlitichev commented 4 years ago

Okay, that helps. Thanks very much for your quick reply.