Closed GoogleCodeExporter closed 8 years ago
It has nothing to do with BlastDB building. In step 2 above, please build
BlastDB
first and open it with seqdb.BlastDB.
Original comment by deepr...@gmail.com
on 14 Oct 2008 at 11:28
bsddb.btopen('R1.seqlen', 'r')
bsddb.btopen has the problem. Whenever I tried to open R1.seqlen, it preload
all
indexes into memory.
Original comment by deepr...@gmail.com
on 21 Oct 2008 at 2:26
Hmm, in my initial test, the time for indexing a file is the same in the current
version and the August 8 version (git commit 11e3814). In both cases, it took
30 sec
(on my macbook pro) to index a file of 1 million sequences.
One difference that I see between the older version and the new version is that
at
the very end of the indexing process I see memory usage expanding rapidly (from
around 5 MB to at least 35 MB), then quickly dropping down to baseline (5 MB).
In
the older version I didn't see any such memory usage surge. If we extrapolate
from
30 MB for 1 million sequences, your case of 50 million sequences might take 1.5
GB,
which could easily send the machine into swap hell, which could make the
process take
much longer than it should. So this seems to fit with what you reported...
OK. I now understand the problem. The bsddb module btree index is screwing us
over:
when you simply ask for an iterator, it apparently loads the entire index into
memory. Anyway, just doing the following causes the 30 MB increase in memory
usage I
mentioned above:
>>> s2 = classutil.open_shelve('R1.seqlen','r')
>>> it = iter(s2)
>>> seqID = it.next()
The memory increase happens when you ask the iterator for the first item, and
the
memory isn't released until the iterator is garbage collected.
The reason this problem was NOT present in earlier versions of Pygr, is that we
used
to have a function read_fasta_one_line() that just read the first sequence line
of
the FASTA file. BlastDB.set_seqtype() used that function to read a line of
sequence,
and then to infer when the sequence is protein or nucleotide.
When we made seqdb more modular (created SequenceDB class), I got rid of
read_fasta_one_line() as being too limited (only works on FASTA format), and
switched
to just getting the first sequence by getting an iterator on the sequence
database.
Now we discover that bsddb iterators act more like keys() (i.e. reads the entire
index into memory) than like an iterator... They are NOT scalable!!!!
You claim that the older version of Pygr can index a file of 50 million
sequences in
1 sec. I guess that might be possible, but it seems much faster than I'd
expect.
Are you sure that you tested indexing of the file, as opposed to just opening an
index that has already been constructed?
Original comment by cjlee...@gmail.com
on 21 Oct 2008 at 2:40
I switched back to using read_fast_one_line() to avoid using bsddb iterator for
initial set_seqtype().
Original comment by cjlee...@gmail.com
on 21 Oct 2008 at 3:01
Original comment by mare...@gmail.com
on 21 Feb 2009 at 2:06
Hi Namshin,
please verify the fix to this bug that you reported, and then change its status
to
Closed. We are now requiring that each fix be verified by someone other than
the
developer who made the fix.
Thanks!
Chris
Original comment by cjlee...@gmail.com
on 5 Mar 2009 at 12:05
Original comment by deepr...@gmail.com
on 6 Mar 2009 at 1:55
Original issue reported on code.google.com by
deepr...@gmail.com
on 14 Oct 2008 at 11:22