Need for solexa seqdb with integer (64bit) ID which requires no hashing/indexing

GoogleCodeExporter commented 8 years ago

A few years ago, memory was expensive and most linux machines did not have 
large memory. 
Thus, in order to reduce memory usage, all shelve files are saved as 
"writeback=False". The problem 
is that it takes much greater time to generate shelve files rather than NLMSA.

If we have a large memory machine, we can greatly save time by just changing 
writeback=False into 
writeback=True. Everything is done in memory and saved once when shelve closes.

I propose unified option for changing writeback status, from False (default) to 
True.

Original issue reported on code.google.com by deepr...@gmail.com on 9 May 2009 at 4:48

GoogleCodeExporter commented 8 years ago

This issue is related to v0.9 and I figured out that writeback True/False does 
not 
save time. It would be much better to just remove bsddb saving. Right, if 
somebody 
want to use Pygr as annotation tool for resequencing project of big genome, 
human, 
it would take several week just to build NLMSA, simply because it takes a great 
time 
to save bsddb!

Original comment by deepr...@gmail.com on 13 May 2009 at 12:20

GoogleCodeExporter commented 8 years ago

Extremely crude idea. 1 character for size of sequence + score (ordinal), then 
save 
sequence + score.

outfile1 = open('test.ifa', 'w')
outfile2 = open('test.pfa', 'w')
infile = open('test.fq', 'r')
iCount = 0
while 1:
    line1 = infile.readline()
    line2 = infile.readline()
    line3 = infile.readline()
    line4 = infile.readline()
    if line1 == '' or line2 == '' or line3 == '' or line4 == '': break
    myacc = line1[1:].strip()
    myseq = '%s%s' % (line2.strip(), line4.strip())
    seqsize = chr(len(myseq))
    outfile1.write('%s%s' % (seqsize, myseq))
    outfile2.write('%s\t%d\n' % (myacc, iCount))
    iCount += 1 + len(myseq)
outfile1.close()
outfile2.close()

Two short .seek and .read operations. One for reading size of sequence and the 
other 
for reading sequence + score. If we know integer sequence ID (position of file, 
we 
can seek by .seek), we can read the sequence and score.

infile = open('test.ifa', 'r')
for lines in open('test.pfa', 'r').xreadlines():
    oldacc, intacc = lines.strip().split('\t')
    intacc = int(intacc)
    infile.seek(intacc)
    seqsize = ord(infile.read(1))
    infile.seek(intacc + 1)
    readseq = infile.read(seqsize)
    myseq, myscore = readseq[:seqsize/2], readseq[seqsize/2:]
    print oldacc, intacc, myseq, myscore

Let me know what you think.

Original comment by deepr...@gmail.com on 14 May 2009 at 2:44

Changed title: Need for solexa seqdb with integer (64bit) ID which requires no hashing/indexing

GoogleCodeExporter commented 8 years ago

Original comment by mare...@gmail.com on 20 May 2009 at 11:05

Added labels: Milestone-Release0.9

GoogleCodeExporter commented 8 years ago

See also the 'screed' package.

Original comment by the.good...@gmail.com on 2 Sep 2009 at 2:00

ofanoyi / pygr

Need for solexa seqdb with integer (64bit) ID which requires no hashing/indexing #89