Open GoogleCodeExporter opened 8 years ago
This issue is related to v0.9 and I figured out that writeback True/False does
not
save time. It would be much better to just remove bsddb saving. Right, if
somebody
want to use Pygr as annotation tool for resequencing project of big genome,
human,
it would take several week just to build NLMSA, simply because it takes a great
time
to save bsddb!
Original comment by deepr...@gmail.com
on 13 May 2009 at 12:20
Extremely crude idea. 1 character for size of sequence + score (ordinal), then
save
sequence + score.
outfile1 = open('test.ifa', 'w')
outfile2 = open('test.pfa', 'w')
infile = open('test.fq', 'r')
iCount = 0
while 1:
line1 = infile.readline()
line2 = infile.readline()
line3 = infile.readline()
line4 = infile.readline()
if line1 == '' or line2 == '' or line3 == '' or line4 == '': break
myacc = line1[1:].strip()
myseq = '%s%s' % (line2.strip(), line4.strip())
seqsize = chr(len(myseq))
outfile1.write('%s%s' % (seqsize, myseq))
outfile2.write('%s\t%d\n' % (myacc, iCount))
iCount += 1 + len(myseq)
outfile1.close()
outfile2.close()
Two short .seek and .read operations. One for reading size of sequence and the
other
for reading sequence + score. If we know integer sequence ID (position of file,
we
can seek by .seek), we can read the sequence and score.
infile = open('test.ifa', 'r')
for lines in open('test.pfa', 'r').xreadlines():
oldacc, intacc = lines.strip().split('\t')
intacc = int(intacc)
infile.seek(intacc)
seqsize = ord(infile.read(1))
infile.seek(intacc + 1)
readseq = infile.read(seqsize)
myseq, myscore = readseq[:seqsize/2], readseq[seqsize/2:]
print oldacc, intacc, myseq, myscore
Let me know what you think.
Original comment by deepr...@gmail.com
on 14 May 2009 at 2:44
Original comment by mare...@gmail.com
on 20 May 2009 at 11:05
See also the 'screed' package.
Original comment by the.good...@gmail.com
on 2 Sep 2009 at 2:00
Original issue reported on code.google.com by
deepr...@gmail.com
on 9 May 2009 at 4:48