pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

ValueError: could not create iterator for region 'himom' #1263

Open dbolser opened 7 months ago

dbolser commented 7 months ago

Not sure if this has been discussed before, but the following was unexpected for me:

tbx = pysam.TabixFile("whatever.tab.gz")

tbx.contigs
['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', ...]

cons_tbx.fetch('himom')
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "pysam/libctabix.pyx", line 507, in pysam.libctabix.TabixFile.fetch
ValueError: could not create iterator for region 'himom'

This is a surprise because:

$ tabix whatever.tab.gz 1:1000000-1100000 > /dev/null && echo 'kk'
kk

$ tabix whatever.tab.gz himom:1000000-1100000 > /dev/null && echo 'kk'
kk

and:

tbx = pysam.TabixFile("whatever.tab.gz")

len(list(tbx.fetch("1:1000000-1100000")))
100001

len(list(tbx.fetch("1:99991000000-99991100000")))
0

len(list(tbx.fetch("himom:99991000000-99991100000")))
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "pysam/libctabix.pyx", line 507, in pysam.libctabix.TabixFile.fetch
ValueError: could not create iterator for region 'himom:99991000000-99991100000'

If my code handles zero results from a given range, then why should it fall over just because a contig isn't found? The cli version doesn't fall over...

I know this is minor, just thought I'd make sure it had been considered properly...

dbolser commented 7 months ago

I just found another example of this that's a bit more serious:

$ tabix genes-sorted.tsv.gz 1:100001-120001 && echo "OK"
1       89295   133566  ENSG00000238009 ENST00000453576 
1       89295   133566  ENSG00000238009 ENST00000466430 
1       89295   133566  ENSG00000238009 ENST00000471248 
1       89295   133566  ENSG00000238009 ENST00000477740 
OK

$ tabix genes-sorted.tsv.gz MT:-100001-120001 && echo "OK"
MT      577     647     ENSG00000210049 ENST00000387314         MT-TF
MT      648     1601    ENSG00000211459 ENST00000389680         MT-RNR1
...
MT      14747   15887   ENSG00000198727 ENST00000361789 ENSP00000354554 MT-CYB
MT      15888   15953   ENSG00000210195 ENST00000387460         MT-TT
MT      15956   16023   ENSG00000210196 ENST00000387461         MT-TP
OK

But:

gene_tbx = pysam.TabixFile("genes-sorted.tsv.gz")

for gene in gene_tbx.fetch("1", 100000, 120000):
    print(gene)

for gene in gene_tbx.fetch("MT", -10000, 10000):
    print(gene)

Gives:

1       89295   133566  ENSG00000238009 ENST00000453576 
1       89295   133566  ENSG00000238009 ENST00000466430 
1       89295   133566  ENSG00000238009 ENST00000471248 
1       89295   133566  ENSG00000238009 ENST00000477740 
Traceback (most recent call last):
  File "/home/danbolser/Work/IUK1/Integration/test_pysam.py", line 11, in <module>
    for gene in gene_tbx.fetch("MT", -10000, 10000):
  File "pysam/libctabix.pyx", line 456, in pysam.libctabix.TabixFile.fetch
ValueError: start out of range (10000)