downloader.py, tarball has unnecessary path characters in uncompressed files

GoogleCodeExporter commented 8 years ago

In tar.gz file, I usually uncompressed the files in directory and moved 
all files to parent directory. And then made one file for pygr.seqdb.

If we want to read that files in python zlib module, there is a problem. 
Because of the path in tar, first line of the extracted faasta file has 
binary characters for path! Thus, pygr does not recognize those files as 
FASTA. Python cannot read following files from downloader.py.

Is there any tar python library? Otherwise, the only solution may be 
extracting files by command line, which may arise another platform 
indepdence issue.

==> apiMel3 <==
Group1.fa0000664000462000024300014446262010701502220012075 0ustar  
angieprotein>Group1

==> caeRem2 <==
chrUn.fa0000664000431100024300116610013110615666653012031 0ustar  
hiramprotein>chrUn

==> canFam2 <==
1/chr1.fa0000664000462000024300075061311410344202301013154 0ustar  
angieprotein00000000000000>chr1

==> cb3 <==
chrI.fa0000664000431100024300005367547210607775707011664 0ustar  
hiramprotein>chrI

==> ce4 <==
chrI.fa0000664000431100024300007251306210605303060011620 0ustar  
hiramprotein>chrI

==> danRer3 <==
10/chr10.fa0000664000552100024300022532764310254136063013701 0ustar  
harteraprotein00000000000000>chr10

==> danRer4 <==
1/chr1.fa0000664000552100024300042252424310422467370012300 0ustar  
harteraprotein>chr1

==> dm3 <==
chr2L.fa0000664000462000024300013142324610636536751011716 0ustar  
angieprotein>chr2L

==> droSim1 <==
chr2L.fa0000664000462000024300012557376010226542767013167 0ustar  
angieprotein00000000000000>chr2L

==> droYak2 <==
4/chr4.fa0000664000462000024300000526216210336451111013164 0ustar  
angieprotein00000000000000>chr4

==> equCab1 <==
chr1.fa0000664000460600024300127305357310565133445012012 0ustar  
fanhsuprotein>chr1

==> fr2 <==
chrM.fa0000664000431100024300000004061610564660635011635 0ustar  
hiramprotein>chrM

==> galGal3 <==
chr1.fa0000664000462000024300141604161610464745460011603 0ustar  
angieprotein>chr1

==> gasAcu1 <==
chrI.fa0000664000462000024300015552750710470675544013104 0ustar  
angieprotein00000000000000>chrI

==> mm6 <==
1/chr1.fa0100664000431100024300136712674310215371015011751 0ustar  
hiramprotein>chr1

==> mm7 <==
1/chr1.fa0100664000431100024300136634417410304660023011747 0ustar  
hiramprotein>chr1

==> mm8 <==
1/chr1.fa0000664000431100024300137663025010375140642011750 0ustar  
hiramprotein>chr1

==> mm9 <==
chr1.fa0000664000431100024300137722222310651703673011614 0ustar  
hiramprotein>chr1

==> monDom4 <==
1/chr1.fa0000664000431100024300553653211710374717613011764 0ustar  
hiramprotein>chr1

==> oryLat1 <==
chr1.fa0000664000431100024300023342162410564147371011610 0ustar  
hiramprotein>chr1

==> panTro2 <==
1/chr1.fa0000664000522300024300157665055710404735056011617 0ustar  
kateprotein>chr1

==> ponAbe2 <==
chr1.fa0000664000462000024300157654750010700310073011572 0ustar  
angieprotein>chr1

==> priPac1 <==
chrUn.fa0000664000431100024300125026220510615674630012030 0ustar  
hiramprotein>chrUn

==> rheMac2 <==
softMask/chr1.fa0000664000552100024300157010116210443356564013730 0ustar  
harteraprotein>chr1

==> rn4 <==
1/chr1.fa0000664000462000024300202234056610406567020011731 0ustar  
angieprotein>chr1

==> strPur1 <==
urchin.hardMasked.fa0100664000441500024301026415226310231150023014111 
0ustar  aampprotein>Scaffold99932

apiMel3/chromFa.tar.gz
caePb1/chromFa.tar.gz
caeRem2/chromFa.tar.gz
canFam2/chromFa.tar.gz
cb3/chromFa.tar.gz
ce4/chromFa.tar.gz
danRer3/chromFa.tar.gz
danRer4/chromFa.tar.gz
dm3/chromFa.tar.gz
droSim1/chromFa.tar.gz
droYak2/chromFa.tar.gz
equCab1/chromFa.tar.gz
fr2/chromFa.tar.gz
galGal3/chromFa.tar.gz
gasAcu1/chromFa.tar.gz
mm6/chromFa.tar.gz
mm7/chromFa.tar.gz
mm8/chromFa.tar.gz
mm9/chromFa.tar.gz
monDom4/chromFa.tar.gz
oryLat1/chromFa.tar.gz
panTro2/chromFa.tar.gz
ponAbe2/chromFa.tar.gz
priPac1/chromFa.tar.gz
rheMac2/chromFa.tar.gz
rn4/chromFa.tar.gz
strPur1/allFa.tar.gz

Original issue reported on code.google.com by deepr...@gmail.com on 14 May 2008 at 12:03

GoogleCodeExporter commented 8 years ago

What about using the 'tarfile' library? 
<http://docs.python.org/lib/module-tarfile.html>

Then you can just use the:
extract()
or
extractall()

member functions to de-archive the wanted files

Original comment by bad...@gmail.com on 21 May 2008 at 4:15

GoogleCodeExporter commented 8 years ago

Hi Namshin,
I'm not sure I understand exactly what you mean.  Is the problem 
1. how to get the contents out of a tar archive file?

or is the problem 

2. how to extract valid FASTA sequence from mis-formatted files AFTER they have 
been
successfully extracted from a tar archive?

Since downloader.py does use the Python tarfile module to extract tar archives, 
I
assumed that the problem must be #2, but after reading your comment above I'm 
not so
sure.  Those "extra characters" look like what you'd see in a tar archive 
header...

If the problem is #1, it should be easy to fix -- we have the tools for 
extracting a
tar archive!  Currently downloader.py should automatically untar any file that 
ends
in .tar, .tgz, .tar.gz, .tar.bz2.  If you have a case where a tar archive is not
being untar'ed properly, please give us both
- URL for the download file that fails to untar properly
- stacktrace showing error message if any

Also, downloader.py does not use the zlib module, so I don't understand what 
you mean
by "If we want to read that files in python zlib module, there is a problem".  
Please
explain.

Thanks!

Chris

Original comment by cjlee...@gmail.com on 21 May 2008 at 11:50

GoogleCodeExporter commented 8 years ago

Hi Chris,

It is #2. You can login biodb.bioinformatics.ucla.edu and 
check /Users/deepreds/projects/test directory. Those are the output files 
generated 
by my downloader script in /Users/deepreds/projects/src. As you can see, some 
of 
the .zip files were not deleted. And, if you see first line of mm8, mm9 you can 
see 
what is going on in those output files by downloader.py

Yours,
Namshin Kim

Original comment by deepr...@gmail.com on 22 May 2008 at 12:08

GoogleCodeExporter commented 8 years ago

Hi Namshin,
did you use the singleFile=True option, which instructs the downloader to 
extract all
the data to a single file, as would be required for a FASTA database?

e.g.
s = 
SourceURL('ftp://hgdownload.cse.ucsc.edu/goldenPath/anoGam1/bigZips/chromFa.zip'
,
                       filename='anoGam1.zip', singleFile=True)

-- Chris

Original comment by cjlee...@gmail.com on 22 May 2008 at 3:16

GoogleCodeExporter commented 8 years ago

Actually, this whole issue was due to tarfile.read() bombing due to using the 
wrong
mode ('r|gz' instead of 'r:gz').  I didn't realize tarfile.open() had two 
different
sets of modes, listed in two different tables in the documentation!  
Specifically,
tarfile.read() was crashing like this:
>>> filepath = downloader.uncompress_file('chromFa.tar.gz', singleFile=True)
untarring chromFa.tar.gz...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/leec/projects/pygr/build/lib.macosx-10.5-i386-2.5/pygr/downloader.py",
line 94, in uncompress_file
    return do_untar(filepath,mode='r|gz',newpath=filepath[:-7],**kwargs)
  File "/Users/leec/projects/pygr/build/lib.macosx-10.5-i386-2.5/pygr/downloader.py",
line 69, in do_untar
    copy_to_file(f,ifile)
  File "/Users/leec/projects/pygr/build/lib.macosx-10.5-i386-2.5/pygr/downloader.py",
line 11, in copy_to_file
    s = f.read(blocksize)
  File "/sw/lib/python2.5/tarfile.py", line 748, in read
    buf += self.fileobj.read(size - len(buf))
  File "/sw/lib/python2.5/tarfile.py", line 666, in read
    return self.readnormal(size)
  File "/sw/lib/python2.5/tarfile.py", line 673, in readnormal
    self.fileobj.seek(self.offset + self.position)
  File "/sw/lib/python2.5/tarfile.py", line 487, in seek
    raise StreamError("seeking backwards is not allowed")
tarfile.StreamError: seeking backwards is not allowed

I'm not sure why Namshin missed this error message.  In general when pygr.Data 
fails
to load a resource, try loading it with via pygr.Data.getResource(name, 
debug=True)
which will raise all exceptions, rather than hiding KeyError and IOError 
because they
are signals that a given resource database cannot provide this resource (so it 
just
goes on to try the next resource database).  However, tarfile.StreamError is 
not a
subclass of KeyError or IOError, so it should have been raised no matter what.  

You can also test it outside of pygr.Data like this:

from pygr import downloader
import pickle
src =
downloader.SourceURL('http://biodb.bioinformatics.ucla.edu/GENOMES/apiMel3/chrom
Fa.tar.gz',
'apiMel3.tgz', singleFile=True)
s = pickle.dumps(src)
filepath = pickle.loads(s) # this triggers the download and uncompress
from pygr import seqdb
db = seqdb.BlastDB(filepath)
s = db['Group1']
print len(s) # 25854376
print str(s[:10]) # 'agcctaaccc'

I pushed the fix to the public git repository.

Original comment by cjlee...@gmail.com on 22 May 2008 at 3:47

Changed state: Fixed

GoogleCodeExporter commented 8 years ago

I may miss the error message because download status messages are TOO LONG. It 
prints out one line per 0.1% progress. Oops...

Original comment by deepr...@gmail.com on 22 May 2008 at 5:01

GoogleCodeExporter commented 8 years ago

Original comment by mare...@gmail.com on 21 Feb 2009 at 1:28

Changed state: FixedNeedsReview

GoogleCodeExporter commented 8 years ago

Hi Namshin,
please verify the fix to this bug that you reported, and then change its status 
to
Closed.  We are now requiring that each fix be verified by someone other than 
the
developer who made the fix.

Thanks!

Chris

Original comment by cjlee...@gmail.com on 4 Mar 2009 at 8:49

GoogleCodeExporter commented 8 years ago

Original comment by mare...@gmail.com on 13 Mar 2009 at 12:52

Added labels: reviewby-deepreds

GoogleCodeExporter commented 8 years ago

Original comment by deepr...@gmail.com on 22 Mar 2009 at 9:32

Changed state: Closed

rpatkennyiii / pygr

downloader.py, tarball has unnecessary path characters in uncompressed files #4