simon-anders / htseq

HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.
https://htseq.readthedocs.io/en/release_0.11.1/
GNU General Public License v3.0
122 stars 77 forks source link

Minor issue with position sorted bam missing mate #72

Closed christiananthon closed 4 years ago

christiananthon commented 5 years ago

When processing a position sorted bam I get the following warning

Warning: Mate records missing for 2734 records; first such record: <SAM_Alignment object: Paired-end read 'D00635:270:CBBRUANXX:3:1102:1964:31010' aligned to chr1:[1424516,1424626)/+>.

But the read and it's mate is actually right next to each other (lines 6991 and 6992 in the samtools output of chr1. My guess is that the algorithm expects the mates to have different start coordinates, but here they are actually identical (1424517) due to the sequenced fragment being short.

6691:D00635:270:CBBRUANXX:3:1102:1964:31010 99 chr1 1424517 60 110M = 1424517 -110 GAGAGGGACCGCTGTGCTCCTGGAGGTTCTGTCCTCGCGGCTGGACACACCTGCTCCTCTCTGGGGGGACCTCGAACCTGGCTGACCACCATAGTCACGCAGGGCCCATC 3:>@BCGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGBGGGGGG=GGBGEGGGGGGGGGGGBD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:0 YT:Z:CP XS:A:- NH:i:1 6692:D00635:270:CBBRUANXX:3:1102:1964:31010 147 chr1 1424517 60 110M = 1424517 -110 GAGAGGGACCGCTGTGCTCCTGGAGGTTCTGTCCTCGCGGCTGGACACACCTGCTCCTCTCTGGGGGGACCTCGAACCTGGCTGACCACCATAGTCACGCAGGGCCCATC GGGGEGAGGGGDBGGGGGGGGGGGGC0GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCBCBC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:0 YT:Z:CP XS:A:- NH:i:1

simon-anders commented 5 years ago

Could you check whether the surrounding lines have different read ID? Sometimes, aligners report an uneven number of alignment lines for a read, and that messes up pairing of the lines.

christiananthon commented 5 years ago

The context of the lines are shown below D00635:270:CBBRUANXX:2:2310:19291:56353 99 chr1 1424416 60 110M = 1424484 178 CCCCAACACGCATGGTGGCAGCAGCACACGTGTCCTGGGCTCCTGGTACTTCACAAACCAGGAAAGCTAGACTCTGAGTCACAGAATAAATACACTCAGCCGAGAGGGAC :30:CF=GG//FFGGGEG>BG>C>FGGGGGD@=FGD00DGDGE1=FEE:G11FGFGGGGGGG@F@BFGGGGGECGF@GG0FBFGGEDCFBCF@DFGG@C.9CB9CE AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:0 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:5:2215:8585:41857 147 chr1 1424462 60 110M = 1424390 -182 TACTTCACAAACCAGGAAAGCTAGACTCTGAGTCACAGAATAAATACACTCAGCCGAGAGGGACCGCTGTGCTCCTGGAGGTTCTGTCCTCGCGGCTGGACACACCTGCT GGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGFGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:0 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:3:2206:7043:73916 99 chr1 1424476 60 110M = 1424585 219 GGAAAGCTAGACTCTGAGTCACAGAATAAATACACTCAGCCGAGAGGGACCGCTGTGCTCCTGGAGGTTCTGTCCTCGCGGCTGGACACACCTGCTCCTCTCTGGGGGGA 3@BBCBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGDGDGGFGGGGGDGGGG@GGGGGGGGDEDGDD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:0 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:5:2309:4106:8108 99 chr1 1424482 60 110M = 1424573 201 CTAGACTCTGAGTCACAGAATAAATACACTCAGCCGAGAGGGACCGCTGTGCTCCTGGAGGTTCTGTCCTCGCGGCTGGACACACCTGCTCCTCTCTGGGGGGACCTCGA @B>@FGEFFCF;EGGGGGGGGGGGGGGGCGGCG>>9/9//EFDGGGFGEGEGC@:1BBDGG==:F@F/CADGG<<@C<FGGDGG@FGGGGCGGF0;6.C.CB.6CC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:-6 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:2:2310:19291:56353 147 chr1 1424484 60 110M = 1424416 -178 AGACTCTGAGTCACAGAATAAATACACTCAGCCGAGAGGGACCGCTGTGCTCCTGGAGGTTCTGTCCTCGCGGCTGGACACACCTGCTCCTCTCTGGGGGGACCTCGAAC D/6=EBEGGGGGC/GGGGG=GGDE/CGGGGFF/GGGGGGGGGC>DGGEFF>BCGGGGGGGGGGGGAGGGGGEC:GGGFBCEFBGGGGGF=BFGGAGGGDF@CCBBAAAA< AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:0 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:2:2110:7470:28171 99 chr1 1424489 60 110M = 1424569 190 CTGAGTCACAGAATAAATACACTCAGCCGAGAGGGACCGCTGTGCTCCTGGAGGTTCTGTCCTCGCGGCTGGACACACCTGCTCCTCTCTGGGGGGACCTCGAACCTGGC :B@BBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDCCGGGGGGGGGGGB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:0 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:3:2306:12547:91173 147 chr1 1424489 60 110M = 1424370 -229 CTGAGTCACAGAATAAATACACTCAGCCGAGAGGGACCGCTGTGCTCCTGGAGGTTCTGTCCTCGCGGCTGGACACACCTGCTCCTCTCTGGGGGGACCTCGAACCTGGC GGGGGGGGGGGGGGGGEGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGBCBBB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:0 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:2:2211:18272:41224 99 chr1 1424498 60 110M = 1424584 196 AGAATAAATACACTCAGCCGAGAGGGACCGCTGTGCTCCTGGAGGTTCTGTCCTCGCGGCTGGACACACCTGCTCCTCTCTGGGGGGACCTCGAACCTGGCTGACCACCA A3BABGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGEGGGDGGAGGGGGGGGGGGGGEGGGGEGGD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:0 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:2:2112:20307:2034 99 chr1 1424514 60 1S109M = 1424605 202 NGCCGAGAGGGACCGCTGTGCTCCTGGAGGTTCTGTCCTCGCGGCTGGACACACCTGCTCCTCTCTGGGGGGACCTCGAACCTGGCTGACCACCATAATCACGCAGGGCC !3<AGG@EGGGGGGGGGGGGGGGGGDGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGDGGGGGGGGEBGGGEGGGEG/DGGGGGDGGGGG AS:i:-4 ZS:i:-14 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:96G12 YS:i:-10 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:5:1107:5679:43464 99 chr1 1424514 60 110M = 1424615 211 GCCGAGAGGGACCGCTGTGCTCCTGGAGGTTCTGTCCTCGCGGCTGGACACACCTGCTCCTCTCTGGGGGGCCCTCGAACCTGGCTGACCACCATAGTCACGCAGGGCCC =@:@BEDGG>/0/9CGGGGEG1<FG><B/<C1@1<:11<EA<CAD/CFGGGGD0FBDGC0FF@@FG>////CE/CDGC>DECGGGGGG/.C6@@/C//C<BDA;..CG AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:71A38 YS:i:0 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:3:1102:1964:31010 99 chr1 1424517 60 110M = 1424517 -110 GAGAGGGACCGCTGTGCTCCTGGAGGTTCTGTCCTCGCGGCTGGACACACCTGCTCCTCTCTGGGGGGACCTCGAACCTGGCTGACCACCATAGTCACGCAGGGCCCATC 3:>@BCGGGGGGGGGGGGGGGGGGGGDGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGBGGGGGG=GGBGEGGGGGGGGGGGBD AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:0 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:3:1102:1964:31010 147 chr1 1424517 60 110M = 1424517 -110 GAGAGGGACCGCTGTGCTCCTGGAGGTTCTGTCCTCGCGGCTGGACACACCTGCTCCTCTCTGGGGGGACCTCGAACCTGGCTGACCACCATAGTCACGCAGGGCCCATC GGGGEGAGGGGDBGGGGGGGGGGGGC0GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCBCBC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:0 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:3:1316:14100:87578 99 chr1 1424530 60 110M = 1424616 196 GTGCTCCTGGAGGTTCTGTCCTCGCGGCTGGACACACCTGCTCCTCTCTGGGGGGACCTCGAACCTAGCTGACCACCATAGTCACGCAGGGCCCATCGGACGGAATGGGG :@@>AC@FGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGDGGGGEGGGCDGDCGBDEGG AS:i:-5 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:66G43 YS:i:-1 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:2:2110:7470:28171 147 chr1 1424569 60 110M = 1424489 -190 GCTCCTCTCTGGGGGGACCTCGAACCTGGCTGACCACCATAGTCACGCAGGGCCCATCGGACGGAATGGGGGACACAGAGGACACCCGAAGTCGGAAGCTCCAGGAGAAC G>G>C.<DGGF@CGGGGGEGGGGGGGGGGGGGGGGGGGGGGGBGGDGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGBCCCC AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:0 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:5:2309:4106:8108 147 chr1 1424573 60 110M = 1424482 -201 CTCTCTGGGGGGACCTCGAACCGGGATGACCACCATAGTCACGCAGGGCCCATCGGACGGAATGGGGGACACAGAGGACACCCGAAGTCGGAAGCTCCAGGAGAACAGCT DC8.8...0008C800908...0;;080/=<0>GC:09/GDGFC900@9=///90F:11FE/<DB:1CGF>1FF>EGEGFGF9B</F1GE;1B>1E>GGB>F?0A333 AS:i:-6 XN:i:0 XM:i:2 XO:i:0 XG:i:0 NM:i:2 MD:Z:22T2C84 YS:i:0 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:2:2211:18272:41224 147 chr1 1424584 60 110M = 1424498 -196 GACCTCGAACCTGGCTGACCACCATAGTCACGCAGGGCCCATCGGACGGAATGGGGGACACAGAGGACACCCGAAGTCGGAAGCTCCAGGAGAACAGCTGTGCCCTCATT BDGGGGGGGGGGGF=GGGGGGGGCGGGBGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGFGGGGGFGGGGGGGGGGGGGGGD<GFEBBCCB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:0 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:3:2206:7043:73916 147 chr1 1424585 60 110M = 1424476 -219 ACCTCGAACCTGGCTGACCACCATAGTCACGCAGGGCCCATCGGACGGAATGGGGGACACAGAGGACACCCGAAGTCGGAAGCTCCAGGAGAACAGCTGTGCCCTCATTG GDFG@GEGGGFEGFEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFBBCBB AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:0 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:2:2112:20307:2034 147 chr1 1424605 60 98M4997N2M10S = 1424514 -202 CCATAGTCACGCAGGGCCCATCGGACGGAATGGGGGACACAGAGGACACCCGAAGTCGGAAGCTCCAGGAGAACAGCTGTGCCCTCATTGTTGGTGACCTGGTTGGGAAC EDF0=0F>.DGEGGGGGFGGGDGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGDGGGGFGGGGGGGGGGGGGGGGGGGGGGGGFGFFGGGGGGGGGBGGGGGGGBCBBB AS:i:-10 ZS:i:-10 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:100 YS:i:-4 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:5:1107:5679:43464 147 chr1 1424615 60 88M16191N22M = 1424514 -211 GCAGGGCCCATCGGACGGAATGGGGGACACAGAGGACACCCGAAGTCGGAAGCTCCAGGAGAACAGCTGTGCCCTCATTGTTGGTGACCTGGTTGGGAACCTGAGTTACG C0090=BGF>88BF8FF0;//::=0:>F@C00CGGCGBGGGGE/>C@C=1:11GGGD>CGGGDGGFDGBE1:C1<FCDEGEGGGF@/F/B@1FCGFC1G>B=BAA: AS:i:0 ZS:i:-20 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:-3 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:3:1316:14100:87578 147 chr1 1424616 60 87M16191N23M = 1424530 -196 CAGGGCCCATCGGACGGAATGGGGGACACAGAGGACACCCGAAGTCGGAAGCTCCAGGAGAACAGCTGTGCCCTCATTGTTGGTGACCTGGTTGGGAACCTGAGTTACGT =GGGGFFGGGGGGGGGGGGGGGGDGGGGGGGGBGGGGGGGGGGGGGGEGGGGGGCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGFGGGCGGGGGCBBBA AS:i:-1 ZS:i:-21 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:-5 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:4:1302:19297:21172 99 chr1 1424654 60 49M16191N61M = 1440932 197 CCGAAGTCGGAAGCTCCAGGAGAACAGCTGTGCCCTCATTGTTGGTGACCTGGTTGGGAACCTGAGTTACGTTTCAAACTGGGTTAAGAGCTCCCTTATTTCCGGGGCCA @@BBBGGGGAGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGDGEGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGG@GGG AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:110 YS:i:0 YT:Z:CP XS:A:- NH:i:1 D00635:270:CBBRUANXX:3:2211:18260:79879 99 chr1 1424655 60 48M4997N62M = 1429702 160 CGAAGTCGGAAGCTCCAGGAGAACAGCTGTGCCCTCATTGTTGGTGACCTCCCAGCCTGATGCA

BhavanaNayer commented 5 years ago

Could you figure out a solution to this problem? Please post it here if possible. Thanks!

iosonofabio commented 4 years ago

I currently have time to do this if you guys share a BAM file and a GTF file that show the issue on google drive or something. Thank you

iosonofabio commented 4 years ago

The official repo for htseq has been moved to: https://github.com/htseq/htseq. Please reopen the issue there and attach a BAM and GTF file (e.g. shar a google drive link) - then I can take a look at the problem.

Closing this one.