phasegenomics / hic_qc

A (very) simple script to QC Hi-C data.
GNU Affero General Public License v3.0
25 stars 5 forks source link

Clear out read variables so we don't accidentally count reads twice i… #40

Closed shawnpg closed 4 years ago

shawnpg commented 5 years ago

Clear out read variables so we don't accidentally count reads twice if something unexpected happens regarding ordering of secondary and/or supplementary alignments, which could lead to overcounting zero-distance read pairs

bnelsj commented 5 years ago

Has this issue been observed? Supplementary and secondary reads are skipped, so this shouldn't happen. That said, clearing out the variables is probably not a bad idea.

shawnpg commented 5 years ago

No, we haven't specifically observed this, but Max and I were able to construct a theoretical case where it led to a spurious zero-distance read pair being counted if the BAM file orders the reads a certain way (page 11 of the spec has an example which shows that it can order things in unexpected ways too https://samtools.github.io/hts-specs/SAMv1.pdf). The test case is on my whiteboard if you want to talk through it. It seems like a very low risk change that can only make things better, and alleviates something that gave us the willies, so we sent it through.

shawnpg commented 4 years ago

This is old but still a good change to get in.