quachtina96 / pysam

Automatically exported from code.google.com/p/pysam
0 stars 0 forks source link

Fetching alignments from BAM files with a large number of reference sequences is prohibitively slow #118

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?

1. Open a BAM file with many reference sequences and call fetch on the object:
track = pysam.Samfile(fname, "rb")
for aln in track.fetch():
    # Do something

What is the expected output? What do you see instead?

I expect a quick response, however, it will take a very long time before the 
iterator object is created. This is likely a pysam specific problem, as 
samtools works fine (==without noticable delay) on these BAM files.

What version of the product are you using? On what operating system?

pysam 0.7.4 on Gentoo Linux.

Please provide any additional information below.

The BAM files contain a header with ~450.000 reference sequences. 

Original issue reported on code.google.com by simon.va...@gmail.com on 27 Mar 2013 at 4:43

GoogleCodeExporter commented 9 years ago
Thanks!

Have you tried track.fetch( until_eof = True )?

Without until_eof, pysam iterates over all aligned reads in the order of the 
reference sequences as they are defined in the bam-file. With many reference 
sequences this will require a lot of jumping around.

until_eof simply iterates from the current file position. I have added an faq 
question on this.

Best wishes,
Andreas

Original comment by andreas....@gmail.com on 27 Jun 2013 at 1:55

GoogleCodeExporter commented 9 years ago
Thanks Andreas!

I actually didn't get the notice that there was a reply, so I forgot about this 
issue. Worked around it by removing a load of (very small) reference sequences. 
But I just did a quick test and the until_eof flag seems to work. Great.

Simon

Original comment by simon.va...@gmail.com on 28 Aug 2013 at 10:19