simon-anders / htseq

HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.
https://htseq.readthedocs.io/en/release_0.11.1/
GNU General Public License v3.0
122 stars 77 forks source link

Mate not appeared for duplication reads #64

Closed 10-m closed 6 years ago

10-m commented 6 years ago

I'm checking the read mates using pair_SAM_alignments. When a read is duplicated, the other mate isn't always available. Do I misunderstand how to use?

sam_reader = HTSeq.SAM_Reader("duplication.sorted.sam")
for first, second in HTSeq.pair_SAM_alignments(sam_reader):
    if (first != None) and first.mate_aligned and first.pcr_or_optical_duplicate:
        print('second:', second) # Always None
simon-anders commented 6 years ago

Your code looks right.

Maybe it's an issue with your SAM file. Perhaps try the following: Take one of IDs of the alignments that enter the if clause (by putting there a print( first.read.name )), grep the SAM file for that ID and check the output manually (or post it here if you're unsure). Check whether there really is a mate.

10-m commented 6 years ago

Thanks a lot! I guess the sam file looks correct.

for first, second in HTSeq.pair_SAM_alignments(sam_reader):
    if (first != None) and first.mate_aligned and first.pcr_or_optical_duplicate:
        print('---')
        print('First  : ', first)
        print('Second : ', second)

then

---
First  :  <SAM_Alignment object: Paired-end read 'M00434:7:000000000-BPCKP:1:1105:24161:6680' aligned to chr1:[16696,16797)/+>
Second :  None
....

and sam file as follows

head -n 100 duplication.sorted.sam | awk '$1 ~ /M00434:7:000000000-BPCKP:1:1105:24161:6680/{print("line", NR, ":", $0)}' | cut -f1-9
line 45 : M00434:7:000000000-BPCKP:1:1105:24161:6680    1123    chr1    16697   23      101M    =       16772   176
line 48 : M00434:7:000000000-BPCKP:1:1105:24161:6680    1171    chr1    16772   23      101M    =       16697   -176
simon-anders commented 6 years ago

What's in lines 46 and 47?

Are they really not the same read ID? I assume that not; otherwise your awk command would have found them.

pair_SAM_alignment always takes a block of consecutive lines from the SAM file and sorts it into pairs. This is why you have to sort by read name (not: by position) first, to ensure at that all alignments for the same read pair are in adjacent lines.

If you don't want to sort, try pair_SAM_alignment_with_buffer, which keeps in a buffer in memory the alignments for which the mate alignment has not yet been encountered. The catch is that this buffer can fill up quickly if many mates are far apart from each other.

10-m commented 6 years ago

I misunderstood the sort method. *_with_buffer worked fine for me.