open2c / pairtools

Extract 3D contacts (.pairs) from sequencing alignments
MIT License
104 stars 32 forks source link

Coordinate Sorted vs Read Sorted BAM files give drastically different results #102

Closed aakashsur closed 2 years ago

aakashsur commented 3 years ago

So I've noticed that when using bam files as input to pairtools parse, it matters if it is straight from the aligner, coordinate sorted, or read name sorted. This is my workflow -

pairtools parse \
    --assembly assembly_name \
    --chroms-path sizes.txt \
    --drop-sam \
    coordinate_sorted.bam | \
pairtools dedup \
    --mark-dups | \
pairtools select \
    "(pair_type=='UU') or (pair_type=='UR') or (pair_type=='RU')" | \
cooler cload pairs \
    --assembly assembly_name \
    -c1 2 \
    -p1 3 \
    -c2 4 \
    -p2 5 \
    sizes.txt:1000 \
    - \
    output.cool 

If I use the coordinate sorted bam, i.e. the output of samtools sort, then I get the following matrix -

[[1 0 0 ... 0 0 0] [0 1 0 ... 0 0 0] [0 0 0 ... 0 0 0] ... [0 0 0 ... 0 0 0] [0 0 0 ... 0 1 0] [0 0 0 ... 0 0 3]]

If I use the read name sorted bam, i.e. the output of samtools sort -n, then I get the following matrix -

[[233 233 6 ... 2 0 0] [233 423 241 ... 1 0 1] [ 6 241 392 ... 2 2 1] ... [ 2 1 2 ... 642 282 19] [ 0 0 2 ... 282 448 217] [ 0 1 1 ... 19 217 406]]

I have the following questions -

1) Is this expected behavior? If so, perhaps there should be a warning.
2) I believe normally the output of aligners is read grouped, does this mean I can expect the same results from the regular bam output and the read sorted bam?

Phlya commented 2 years ago

Sorry it's been so long, I'll close this issue.

The pairs have to be coordinate sorted for deduplication to work correctly. So, sorting has to be done before deduping.