pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

Reset the iterator fetch() returned. #902

Open yangyxt opened 4 years ago

yangyxt commented 4 years ago

Dear @pysam-developers, As introduced in FAQ, fetch() returned an iterator upon calling. Given the performance optimization, multiple_iterators is not recommended. Plus the BAM file I'm trying to iterate through contains a large number of reference sequences. Thus, I need to use until_eof=True to iterate through BAM files.

The issue is, I would like to reset the iterator after every thorough iteration. But I didn't find an available approach to realize this. I tried to use: iterator=bam.fetch(until_eof=True) for align in iterator: xxxx iterator.seek(0) I got the error instead:

AttributeError: 'pysam.libcalignmentfile.IteratorRowAll' object has no attribute 'seek'

I also tried to use: for align in bam.fetch(until_eof=True) In this case, I did not reset the iterator, and next time I would get 0 iterations when calling bam.fetch(until_eof=True).

If available, could you pls share an available approach in resetting the iterator returned with fetch(until_eof=True). Thanks!

arogozhnikov commented 4 years ago

double this issue. So far the only way I know is to open the file again

yangyxt commented 4 years ago

double this issue. So far the only way I know is to open the file again

I just found a way that might solve this issue. If we want to use the same one iterator every time, we can use bam.reset() command after each iteration to make the iterator point back to the beginning of the file.

Brunox13 commented 3 years ago

Not exactly sure about your specific needs here but could you simply iterate over the AlignmentFile() without using fetch()? As in:

bam = pysam.AlignmentFile('path/to/file.bam', 'rb')
for read in bam:
    ...

If you need to iterate over the entire bam file each time (without using specific coordinates or only looking at aligned reads), this should work without needing to open/close the file in between.

Brunox13 commented 3 years ago

Not exactly sure about your specific needs here but could you simply iterate over the AlignmentFile() without using fetch()? As in:

bam = pysam.AlignmentFile('path/to/file.bam', 'rb')
for read in bam:
    ...

If you need to iterate over the entire bam file each time (without using specific coordinates or only looking at aligned reads), this should work without needing to open/close the file in between.

^This is actually wrong - I just tested this and using simply

for read in bam:

after having used

for read in bam.fetch(until_eof = True):

somewhat surprisingly seems to yield similar results to just doing fetch() repeatedly. Therefore, I agree that currently, the best way to "reset" the iterator over the bam file is to close & re-open it again. (Which I also tested.)

EDIT: This is actually also mentioned in issue #66.

arogozhnikov commented 3 years ago

It's just somewhat wrong behavior that iterator is not instantiated each time. There could be some internal limitations to create two parallel iterators, in this case it would be fine jsut to raise an exception. But when someone calls fetch, there should be no implicit assumption that no iteration over the file was done with this description