Closed 10-m closed 6 years ago
Your code looks right.
Maybe it's an issue with your SAM file. Perhaps try the following: Take one of IDs of the alignments that enter the if clause (by putting there a print( first.read.name )
), grep the SAM file for that ID and check the output manually (or post it here if you're unsure). Check whether there really is a mate.
Thanks a lot! I guess the sam file looks correct.
for first, second in HTSeq.pair_SAM_alignments(sam_reader):
if (first != None) and first.mate_aligned and first.pcr_or_optical_duplicate:
print('---')
print('First : ', first)
print('Second : ', second)
then
---
First : <SAM_Alignment object: Paired-end read 'M00434:7:000000000-BPCKP:1:1105:24161:6680' aligned to chr1:[16696,16797)/+>
Second : None
....
and sam file as follows
head -n 100 duplication.sorted.sam | awk '$1 ~ /M00434:7:000000000-BPCKP:1:1105:24161:6680/{print("line", NR, ":", $0)}' | cut -f1-9
line 45 : M00434:7:000000000-BPCKP:1:1105:24161:6680 1123 chr1 16697 23 101M = 16772 176
line 48 : M00434:7:000000000-BPCKP:1:1105:24161:6680 1171 chr1 16772 23 101M = 16697 -176
What's in lines 46 and 47?
Are they really not the same read ID? I assume that not; otherwise your awk command would have found them.
pair_SAM_alignment
always takes a block of consecutive lines from the SAM file and sorts it into pairs. This is why you have to sort by read name (not: by position) first, to ensure at that all alignments for the same read pair are in adjacent lines.
If you don't want to sort, try pair_SAM_alignment_with_buffer
, which keeps in a buffer in memory the alignments for which the mate alignment has not yet been encountered. The catch is that this buffer can fill up quickly if many mates are far apart from each other.
I misunderstood the sort method. *_with_buffer worked fine for me.
I'm checking the read mates using pair_SAM_alignments. When a read is duplicated, the other mate isn't always available. Do I misunderstand how to use?