simon-anders / htseq

HTSeq is a Python library to facilitate processing and analysis of data from high-throughput sequencing (HTS) experiments.
https://htseq.readthedocs.io/en/release_0.11.1/
GNU General Public License v3.0
122 stars 77 forks source link

Error fetching reads (bam_reader.fetch) with PACbio/minimap2 reads #89

Closed fgypas closed 4 years ago

fgypas commented 5 years ago

Hi

I (want to) use HTSeq (0.11.2) to fetch some PACbio reads, that were mapped with minimap2. You can find an example file at the end. When I try the following example:

bam_reader = HTSeq.BAM_Reader(bam)
for aln in bam_reader.fetch(region=region_string):
    print(aln)

I get the following error:

for aln in bam_reader.fetch(region=region_string): File "miniconda3/envs/htseq_latest/lib/python3.6/site-packages/HTSeq/init.py", line 1111, in fetch yield SAM_Alignment.from_pysam_AlignedRead(pa, sf) File "python3/src/HTSeq/_HTSeq.pyx", line 1327, in HTSeq._HTSeq.SAM_Alignment.from_pysam_AlignedRead AttributeError: 'NoneType' object has no attribute 'encode'

Do you know what is the issue? Should I use HTSeq to fetch PACbio reads? When I try the following it's working fine:

for a in itertools.islice( bam_reader, 5 ):  
    print(a)

Example of mapped read:

m54273_190430_131817_42271220_76742_83968 0 1 14356 1 146S5M2D13M1I8M1I2M1D8M1I9M1I18M1I15M2I30M1D3M1I10M1I8M2D23M1I7M1I5M1I30M1I25M1I31M1D11M1D40M1D5M2D19M1D6M1I14M1I66M2I51M140N18M1D42M1D7M272N7M1D18M1I2M1I2M1I6M1I23M1I6M1D2M1I8M1D37M1D22M1D3M1D4M1D31M1I4M2I28M275N17M1D4M23I6M1I4M1D26M1I25M2I5M1I19M1I23M1D12M1I31M1D4M2D4M1D14M1I6M1I9M4I41M1D10M1D26M2I1M1I11M1D8M1D18M2D4M1D69M5I52M3I12M1I4M1I31M1I5M296N11M3I2M1D8M1D17M1I5M1D13M1I8M1D15M1I24M1D11M1D8M1D30M88N16M1I3M1I35M1D8M1I14M1D13M1D13M1D18M5I17M1I20M1I25M1I16M177N6M1I7M1I7M1I32M1I10M1D13M2I10M1D17M1D2M2D27M237N6M1D7M1I34M1I11M1I26M1I37M1D6M1D7M172N17M1I21M1D34M1I8M1I43M1D17M1D4M206N10M4I15M1I12M1I18M1I11M1D9M2I9M2D12M546N42M1D13M1D7M1D1M1D4M1I26M2D19M1D47M1I4M1I2M1I12M1D9M1I7M1I20M1I55M1I10M1I25M1I11M1D4M1D17M1I9M1D3M1I2M1I14M1D21M1D24M1D11M1I4M1D37M1D6M1I13M1D8M1I8M1I4M1I16M1I16M1D37M1I6M2D2M2I8M1D52M1I24M1D40M1I25M1D11M1D31M1I1M1I14M1I12M1I6M2I36M1I34M1D3M1I12M5I8M1I4M1I9M1I33M1I4M1I18M1D9M1D5M1I3M1D63M4I10M2I1M1I19M1I2M1D16M2I4M1I31M1D15M1D4M1D38M1D2M1D7M1D25M1I11M1D1M1D45M1I1M1I17M1I2M1I26M1I39M1D9M1D12M1I31M1D44M1I15M1D12M1D34M1D44M1D31M1D19M1D28M1I6M1D9M1D16M1D18M1I33M1D81M1D4M1D5M1I9M1I38M1I11M1I4M1D12M1D19M1D36M1I4M1D24M1I40M1D5M1I4M1D2M1D68M1I2M1D6M1I18M1I17M1D11M2I2M1D20M1D13M1D5M1D3M1D10M1D13M1I25M1D13M2I31M1I27M1I41M1I10M1I86M1D16M1D25M1I51M1D31M1D8M1D18M1I1M1I8M1I49M1D10M2I24M2I14M1D25M1D65M1D7M1I28M1D5M1D85M1D3M1D18M1D7M1D14M1D15M1D4M1D10M2I16M1I10M1D10M1I15M1D39M1D20M1I27M1D7M1D16M1I16M1D13M1I8M1I3M1D5M1D20M1D4M1I36M1D2M2D1M1D4M1I10M1I7M1I25M1D10M1D10M1D12M1I29M2D19M1D57M1I14M1I4M1D11M1I12M1D6M1I30M2I33M1D8M1I18M1D16M1I7M1I6M1D11M1D21M1I23M1I4M1D5M1D20M1D21M1I12M1D7M1D14M1D19M1I8M1D14M1D10M1D4M1D8M1D16M1I9M1D13M1I1M1I16M2I5M1I83M1I14M1D14M1I4M1I45M1I17M1D33M1D8M1D11M1I5M1I4M1I15M1D12M1I16M2D14M2I1M1D55M1D15M1D14M1D10M1I5M1D10M1D2M13I4M1I2M1I5M1D40M1D15M1D7M1I4M1I7M1D27M1D53M1D16M1I35M1I10M1D20M1I5M1I19M1I7M1D36M1D22M1I6M1D8M1I10M1I21M1D30M1I4M1I8M31S 0 0 AAGCATGGATCCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTGTGTTTTTTGTTTTTTTTTTTTTGGTGGGGGGTGTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGCCCTGCACAGCTAGAAGGTCCTTCTATAAAAGCATCACTGTTGGTTTTCTGCTCAGTTCTTTATTTGATTGGTGTGCCGTTTTTTCTCTGGAAGCCTCTTAAGAACACAGTGCGTCAGGCTGGGTGGGAGCCGTCCCCATGGAGCACAGGCAGACAGAAAGTCCCGCTCCCGCATCTGTGTGGCCTCAAGCCAGCCTTCCGCTTCCTTGAAGCTGGTCTCCACACAGTGGCTGGTTCCGTCACCTCCTCCCAGGGAAGCAGTCTGAGCAGCTGTCCTGGCTGTGTCCATGTCAGAGCAACGGACCAAGTCTGGTCTGGGGGAAGGTGTCATGGAGCCCCTAGGGATTCCCAGTCATCCCTTGTCCTCGTCTACCTGTGGCTGCTGCGTTGGCGGCAGAGGAGGGCTGGAGTCTGACACGCGGGGTCAAAGGCTCCTCCGGGCCCCTCACCAGCCCCAGGTCCGTTCCCAGAGATGCCCTTGTGCCTCAGGACCACTTGTTGAAGAGATCCGACATCAAGTGCCCACCTTGGCTCGTGCTCTCACTGGGGTACAGAGCAAGGCAAAAGCAAGCCGCCTGGGTAACAAGCTCAAAACCATAGTGCCCGAGGGCATGTCCGCTGCAGCGCCGGCATCGCATCACACCAGTGTCTGCGTTCACAAAGGCATCATCAGTAGCCTCCAAGGCTCAGTCCATTCTCTAAAAATATCTCAGGAGGCTGTCAGTGGGGCTGACCATTGCCTTGGACCGCTCTTGCTTGCTCCTGCTCCTTCGCTGTTTTTTTTTTTTTTTTTTTTTTTCTTCTTCCTCCGCTTTCGCTCCTTCATGCTGCGCAGCTTTGGCCTTGCCGATGCCCCCAGCTTTGGGCGGAATGGACTCTAGCAGAGTGGCCCAGCCACCGGAGGGGTCGACCATTCCCTGGGAGCTTCCCTGGACTGGAGCCGGGAGGTGTGGAACAGGCAGAGGAAGGCTGCTCAGGCAAGGGCTGGGGGCAGCTTAACTACTGTGTCCAAGAGCCTGCTGGGATTGAAGTCACCTCCTCCAAACGAGGACCCCGCGCTGGGGAGGCCGGACCTTTTGGAGAGACTGTGTGGGGGGCCTGGCACTGACTTCTGCAACCCTGGCGCGGGCATCCTGTGTGCAGATACTCCCTGCTTCCTCCTTAGCCCCCACCCTGCAGAGCTGGACCCCTGGGGGGAGCTAGCCATGCTCTGACAGTCTCAGTTGCACACATGAGCCAGCAGAGTGGTTTTTTTGTGCCACTGTCTGTGATGATAGTGTTACACTGGGAGATACAGCAGTTGAAGCTGAAGGAGACGTTGCCTTCTGCTCTGTCGTCCTGCTGGGCTGCCTTGCCTACAGGGGCCGGCGGTTGAGGTGGGAGTGGGGGTGGCACTGGCCAGCACCTCAGGAGCTGGGGTGGTGGTGGGGCGGTGGGGTGGTGTTAGTACCCCATCTTGTAGGTCTGCCTTGAGAGGCTCAGGCTTACCTCAGTGTGGAAGGTGGGCAGTTCTGGAATGGGCCAGGGGCCAGAGGGGGCAATGCGGGGCCCAGGTCGCAAGGTACATGAGTCGTTGGCAATGCTGGGAAAAAAAAGTCAGGCAGGTAGGAATGTAACATCAATCTCAGCCAACCGGGCCCAGGTCTGGCACATAGACAGTAGTTCTCTGGGACCTGCTGTTTCCAGCTTGCTCTCATCTTGCTGATGGACAAGGGGGCATCAAACAGCTTTCTCCTCTGCTCCGCCCCCAGCAAATCACATGGGCTTTGTTACAGCACCAGCAGGGTCCAGGAAGACCTACTTCTTGTACAGGTTCCGGTGGTGGTTGAAGAGCAGCAAGGAGCTGACAGAGCTGATGTTTGCTGGGAAGTACCCCCAGGTCCCTCTTCTGCATCGTCCCTCAGGCTCCGGCTTGGTGCTCACGCACACAGGAAATTCTTCGCTTCTCCTGCAGGGCCGCCTCGTCCCAGGGGGCGGTGCTTGCTCTGATCCTGTGGCGGGAGCGTCTCTGCAGGCCAGGGTTCCTGGGCGGCCCGTGAAGATGGAGCCATATTCCTGCAGGTGCTCTGGAGCAGGTACTTGGCACTGGAGACACCTTGATGGCCTTCTTTCTTGCTGCCCTTGAATCTTCTCAATCTTTGGCCTGGGCCAAGGAGTACCTTCTCTCCATGGCCTGCCAACCTGGCTCGCTCTGCTCTACCTGCTTCCATCCTCCCTGGTGCGGGGTGGGCCCAGTGATATCGCTGCCTGCTGTTCCCAGATTCCAATATGCATTCTTGTGTCTTTGCGTCTCAGAACGCCATTTCCCCAGACTCCCTGTGGCTGGCTCATGATGCCCGAGGCCCAAGTGTCTGATGCTTGTAAGTGCAACATCACCCCACCATGCTTTTCCCATGTTCCTTTGGCCGCAGCAAGGCAGCCTCTCACTGCAAAGTTAACTCTGATGCGTGTGTAACACGACATCCTCCTCCCCGTCCGCCCCTGTAGGCTCCCCTACCTCCAAGAGCCCAGCACCATGCCCACAGGCCCACTCCACATGCACAGCAAGCCTCAGCCTCAACTCGGGCATGAGCGAGCTGTGTGGTGCGCAGGGATGATAGGCAGAGGCGCGACTGGGGTTCTGAGGAAGGGCAAGGAGAGGATGTGGGATGGTGAAGGGGTTTGAGAAGGCAGAGCACGACCTGGGGTTCATGAGAAAGGGAGGGGGGAGGAATGTGGGGATGGTGGAGGGGCTTGCAGACTCTGGGCTAGGAAAGCTGGGATGTCTCTAAAGGTTGGAATGAATGGCCCTAGACCGTGACCCAATAGCCACAGCACCTTCCACCAACGTTAGAAGGCCTTGGCCCCCAGAGAGCCAATTTTCACAATCCAGAAGTCCCCGTGCCTAATGGGCCTGCCCTGATTACTCCTGGCTCCTTGTGTGTCAGGGGGCTCAGGCATGGCAGGGCTGGAGTACCAGCGGCACTCAAGCGGCTTAAGTGTTCCATGACATGAACTGGTATGAAGGTGGGCTACAATTCATGAAAGAACAAAAAGACGCGCACCATCGCCTTCCATTGAGGAAGCGGGGGCCACCACCACGCGTGTGCTCCATCTTTTCTGCTCGGGGAGAGGCCTTCAAATCATCTGCGTGTAGAAGGGTCCTGCCAGCACAAGCTGTTTTAATTGACACTAGTTCCTCAGGCGCAGCCTCGTTCTGCCTTGGTGCTGACCGACCATTCGTAGGTGCATAAGCTCTGCATTCGAGGTCCACATGGGCAGTGGGAGGGAACTGAGACTGGGGAGTGGGGGACAAAAGTCCTTGCTCTGGCCTGGTGCTCCACAAAAGGAGAAGGGCTGAATTCACTTCCAAGTTGCGAACACCAAGCTCAACAATGACCCTGGAAAATTTCTGAATGATTATTAAACAGAGAGTCTGTAAGCACTTAGAAAAGGCGTGTGCGTCCATGGGCCAGCACTGCTCGAAATGTTACAGCATTTCCTTGTAACAGGATTATTAGCCTGCTGTGCCCGGTGAAAACATGCAGTCGACAGTGCATCTCAAGTCGAGTCAGGATTTTGACGGCTTCTAACAAAATTCTTGTAGACAAGATGGAGCTATGGGGGTTGGAGGAGAGACATATAGGAAAATCAGAGCCAAAATGAACCACAGCCCCCAAGGGCACAGTTGACAATGGACTGATTCCAGCCTTGCACGGAGGGATCTGGCAGAGTCCCATCCAGTTCATTCACACCTGTTTAGAAACTGGGTCCAGCACACAGGGGAAGGGTAAGCTGTTTCATGAGCGAATCAAGGCTCAGACAATTTTTTAAGGCCAGAGGTAGACTGCAATCACCAAGAGGAAATTTACAGGAACAAATGTGAAGCCCACATTTAGGTTTTAAAAATCAAGCGTCTAAAATACGAAGGTGGAGGAACTTGCTTTAGACCCGTTCAGGTGAAGAAAGAAACTGGAAACTTCTGTTAACTATAAGCTCAGTAGGGCTAAAAGCATGTTAATCGGCATAAAAAGGCAATGAGATCTTAGGGCACACAGCTCCCCGCCCCTCTTCTGCCCTACATCTTCTTCAATTCAGCAGGGAACCGTGCACTCTCTTGGAGCCACCACAGAAAACAGAGGTTGCATCCAGCATCCACGAAAACAGAGCCCCACAGAAAACAGAGGGTGCTGTCATCCCCTCCAGTCTCTGCACACTCCCAGCTGACAGCGAGCAGAAGGAGAGAGCACAGCCTGGCAATGCTAATTTGCCAGGAGCTCACCTGCCTGCGTCACTGGCACTAGACCCGTGAGGCCAGAGGCCGGGCTGTGCTGGGACCTGAGCTGGGTGGTGGGGAGAGAGTCTCTCCCCTGCCCACTTCTCTTCCCGTGCAGGAGGAGCGTGTTTTAAGGGGAAGGGTTCAAGCTGGTCACAAATCCCACCAAAAAAAGCCATGGCAACGAAAAGCCCCTAGCTGTCAGTGCCACAGAGGGGCAAGTGGTAGGAGTAGAGGTGGCGGTGCTCCCCCTCCACTGCCAGTCCCGTCACTGGCTCTCCCTTCCCTTCATCCTCGTTTCCCTATCTGTCACCATTTCCTGTCGTTTGTTTCCTCTGAATGTCTCACCCTGCCCTCCCTGCTTGCACAGTCCCCTGTCCTGTAGCCTCACCCCTGTCGCATTCCGACTACAATAACAGCTTCTGGGTGTCCCTGGCATCCACTCTCTCTCCCTTCTTATCCCTTCGTGACGGATGCCTGAGAACCTTCCCCCAACTCTTCTGTCCTCATCCCTGCCCTGCTCAAATTCCAATCACAGCTCCCTAACACTCCTGAATCACTGAAAGTCCTGTCTTGAGTAATCCGAGGGCCTACCTCCTCATCCCGACTCTTCACATCCACTGCCCTTGCCCCACACCCTGCCAGGGAGCCTCCCGTGGCACCGTGGGGACACAAAGAACCAGGGCCAAAAGCTCCCGCAGCCCCATTCAAAATGAGGCCTGGCCCACGGCTCACTGAAAGTCAGCCTCTCATCCCCGAGAGATGAGTGCAAGGGAGAGGCATCGCTGTCTGTGCTTCCCATGCAGAAGCTCCCCCCTCCACCCCTTTTGCAGGCCGGCCTTCGCGGCATACCACATACCCACGTTCCAAGCCACACTGAGGCCTCCCTCCAAGCCTGCAGCCCCCATTTCCAGACCCAACCAGGGCAACCTGCATATCCACCTCCTCCCTGCCCCCCTCTTCCAGAGTCTGCCTTTGTGGAGTAACACGTGGTTTTCCTCTCAGAACTATTCCCTTTTTTTACTCAAGCAATGGGCCCCATTTCCTTGGGGAATTCCATCTCTCTCGCAGCTTAGTCCCAGAGCTTCAGGTGGGGCTGCCCACAGAGCCCTCAGTCTAAGCCAAGTGGTTGTGTCATAGTCCCCTGGCCCAAGTAAGGATTCTGGATGAACATGAGGACGCAAGCCAGGTGGGATGGTGAGTGTGGCTTCCTGGAGGACAGTGGGACCAGGACAGCATTCTTTCCTGTGGACCCTAACCCTGTGTCATGTCACCTTGCTACCACGAGACAGCTGTCCTGGGAATGCAAGCCAGACCCCAAAGAAGCAAACTGACATGGAAGAAAGCAAAACAGGCCCTGAGACATCATTTTAGGCCCTTACTCCGAAGGCTGCTCTACTGATGTTAATTTTTGCTAGGCTTGTCTGGGGAGTTCTGACAGGCGTGCCACCAATTCTTACCGATTACTCTCCACTCTAGAACCCTGAGAAGCCCTACGCGCTCATGCTAGTCAATTAACAATCATCTCGCCCCTATGTGTTCCCATTCCAGCCTCTATGAACCCCAGTGGCAGCCACATAATTGGTATCTCTTAAGTCCAGCAGCGAGGTGGAGCACATGGTGAGAGACAGATGCAGTGGACCTGGAAACCCAGAGTGAGGGAGCAGGCATCAGGCCCAAGGCTCGCTGAGAGGCATCTGGCCCTCCCTCGCGCGTGCCGCAGCTTGGATAACCCACACCAATGAACGCAGCACTCCACTGACCCAGGAAATGCTTCCTGCCTCTCCTCATCCCTCCCTGGGCAGGGGACATGCCAACTGTCACAAGGTGCCAAGTCCAGGACAGGAAGGAAGATGCCAAAATCCAGCGCTGCCTCTCAGAGAAGGCAACCACGCAGCTCCCCCATCTTGGCAAGGATAAACACCAATTTCCGAGGGAATGGTTTTGGCCTCCATTCTAAGTGATGGACCTGGGGTGGCCATAATCTGGAGCTGATTGCTCTTAAAGAACCTGCATCCTCTTCCTAGGCGTCCCTCGGGGCAACATTTAGCACAAAGCTAAGCACAAAAGGTGCATCCAGCACTTTGTTTCCTATTGGTGGCAGGTCCTGAATGGCAACCAAAGGCAGTGTACGGGTCAGATTATCACAGGGAAGAGAATAGCAATTTGCCTGAAGGCTTCCTAGTGCCAGGCACTGGTTTCATTCCTTTGCATTTTGATTAATTTATGAATTTAAAATAATTCTACCAGGAAGCTACCATGATTGACACAACTTCAAAAATGAGACACCGAGGCTTAGAGGGTTGGGTGGCCCAGGTTACAGAGGGAAGAAACAGGGGACTTTTTTGGGGGGGGGGATACTGGAACCAGGCATCAACTCCAAGGTAACCCCTCAGTCACTTCAGTGTTGTCCCCTGGTTACTGGACATTCCTTGAACAAGCTGGGGCAAGCCGGTGAGTCAGTGGGTGAGACTTTCCGGAAGAGGTGGTTTCCCAGTTGGTGACAGAAGCGGAGGCTGCAAATGGAAGAGCAGGGGCTAAAACGTCTGACGACAACCAGGGAATGGACAGGGCAGGGGATGACTTGACAACGAGAGGCACCCGAGTTCAGGCCAGTCACATACTTCCCGCTGGGGGTCTCCATGTGGGGCATGGTGTGGGATCCTGGGAAGGAGACAAGCCTCATTTCAGGTTGCTTCATGGCCAATACAGGAACCTGTGTACACCGACAACCCCTGGGACCTTTGCAAAAAACAAGCAAACACCATTCACTCACGTCATAGTTAGATACCCATGTACTCTGGCGTTGATACCACTGCTT NM:i:678 ms:i:4917 AS:i:4575 nn:i:0 ts:A:- tp:A:P cm:i:582 s1:i:3330 s2:i:3314 de:f:0.0821 rl:i:189

Thank you in advance Foivos

simon-anders commented 5 years ago

Maybe the first 5 reads are fine, but within the selected region, they are somehow messed up. Can you run the simple loop without the islice (and perhaps without the print), to see if you get through the whole file without error?

fgypas commented 5 years ago

Hi @simon-anders

Thanks for the quick reply. I looped over the file without a problem.

I uploaded here an example file: bam: https://drive.google.com/open?id=1WPWGUCi9oD-hfbvq9CqVkKTi3OG4BCl1 bai: https://drive.google.com/open?id=1zx0z7esY00Lalvg1oTW0alM7AkTSIibX

You can get the error by running:

for aln in bam_reader.fetch(region="1:14362-15362"):
    print(aln)

Thank you in advance Foivos

iosonofabio commented 4 years ago

This was a bug in the parser. It was fixed in the latest master, which will be released in a few days. You can already install from the github master for the fix. Closing.