open2c / pairtools

Extract 3D contacts (.pairs) from sequencing alignments
MIT License
104 stars 32 forks source link

Question about pairtools parse strand information #167

Closed jiangshan529 closed 1 year ago

jiangshan529 commented 1 year ago

Hello, I am trying to understand the details about how pairtools works. In my pair.gz file, I choose a read pair to do the test, the read pair looks like this:

SRR710074.10273713      chr1    1054204 chr1    1054648 +       -       UU 

Then I searched the line number in .bam file and found:

SRR710074.10273713      81      chr1    1054613 60      36M     =       1054204 -445    ACTCCACCCCCCAGCGCCCACCCTTGAGTCAGGGTG
SRR710074.10273713      161     chr1    1054204 60      36M     =       1054613 445     GTCGCTCCAGTCTGAGCCTGGCCGTCGCCTCCAGCA

I use the blat tool in IGV genome browser and found actually the two reads are both on + strand, not as shown in the pair.gz file that one is on + strand and another on - strand. So how should I understand this?

I also searched 'UU' in the pair.gz file and do blat for some 'UU' pairs, and some of them can be blasted to several sites on the hg38 genome. And it is not what defined by 'UU'. So how should I interpret this? Thanks for your help!

agalitsyna commented 1 year ago

hi, @jiangshan529

Firstly, Hi-C is a paired-end sequencing method; alignments in a pair can originate from the opposite sides of your DNA molecule. See explanations in our docs on bam parsing.

Next, the orientation of the alignments is recoded as SAM tags in the second field of your sam/bam file, and you can decode them, e.g., here: https://broadinstitute.github.io/picard/explain-flags.html

In your case:

So two alignments you report are already mapped to the opposite strands of DNA. IGV reports something else, but this might be some IGV convention of reporting for paired-end reads, which is unrelated to conventions of Hi-C data interpretation.

Finally, let's dig into what standard pairtools parse is doing (and I assume you did not change reporting orientation settings, right?)

  1. It reports two alignments as a pair, and it will be something like: SRR710074.10273713 chr1 1054648 chr1 1054204 - + UU

  2. However, the left alignment has a larger coordinate than the right one, and pairtools by default performs the flipping procedure so that the alignment with smaller coordinate is always first. So the pair will look like this: SRR710074.10273713 chr1 1054204 chr1 1054648 + - UU

Let me know if you have further questions. This is also a discussion rather than an issue. I will transfer it there for you, but I appreciate it if this kind of help request would go into discussion directly next time. Thanks.