lpinner / metageta

Metadata Gathering, Extraction and Transformation Application - Unmaintained
Other
5 stars 4 forks source link

Error on compressed file #53

Closed alexlopespereira closed 8 years ago

alexlopespereira commented 8 years ago

Hi @lpinner, I have run the following command using metageta 1.4.0 and it returns an error. Using metageta 1.3.9 no such error appears while testing over the same images. I could delete the erroneous file, but I don't know which file is this. The filename was not logged into the log file. I would like to know whether you could verify and try to fix it (if it's necessary), or maybe handle the exception so that it does not stop the crawler.

Thank you very much, Alex

[alex@srvindice ~]$ runcrawler --debug -a -r -d /media/imagens/ -x indice Unable to import Tix, tkFileDialog and/or tkMessageBox 13:46:37 DEBUG /usr/bin/runcrawler --debug -a -r -d /media/imagens/ -x indice 13:46:37 WARNING Use formatting objects such as font directly Warning 6: Normalized/laundered field name: 'datecreated' to 'datecreate' Warning 6: Normalized/laundered field name: 'datemodified' to 'datemodifi' Warning 1: Field filepath of width 255 truncated to 254. 13:46:37 INFO Searching for files... Traceback (most recent call last): File "/usr/bin/runcrawler", line 9, in load_entry_point('MetaGETA==1.4.0', 'console_scripts', 'runcrawler')() File "/usr/lib/python2.7/site-packages/MetaGETA-1.4.0-py2.7.egg/metageta/runcrawler.py", line > 430, in main optvals.recurse,optvals.archive,optvals.excludes) File "/usr/lib/python2.7/site-packages/MetaGETA-1.4.0-py2.7.egg/metageta/runcrawler.py", line > 135, in execute Crawler=crawler.Crawler(dir,recurse=recurse,archive=archive,excludes=excludes) File "/usr/lib/python2.7/site-packages/MetaGETA-1.4.0-py2.7.egg/metageta/crawler.py", line 63, in init recurse=recurse, archive=archive, excludes=excludes): File "/usr/lib/python2.7/site-packages/MetaGETA-1.4.0-py2.7.egg/metageta/utilities.py", line 552, in > rglob paths = archivelist(fullname) File "/usr/lib/python2.7/site-packages/MetaGETA-1.4.0-py2.7.egg/metageta/utilities.py", line 85, in archivelist lst=[ti.name for ti in tarfile.open(f,'r').getmembers() if ti.isfile()] File "/usr/lib64/python2.7/tarfile.py", line 1805, in getmembers self._load() # all members, we first have to File "/usr/lib64/python2.7/tarfile.py", line 2380, in _load tarinfo = self.next() File "/usr/lib64/python2.7/tarfile.py", line 2315, in next self.fileobj.seek(self.offset) File "/usr/lib64/python2.7/gzip.py", line 434, in seek self.read(1024) File "/usr/lib64/python2.7/gzip.py", line 261, in read self._read(readsize) File "/usr/lib64/python2.7/gzip.py", line 308, in _read self._read_eof() File "/usr/lib64/python2.7/gzip.py", line 347, in _read_eof hex(self.crc))) IOError: CRC check failed 0x1d823b02 != 0x1bcc2669L

lpinner commented 8 years ago

Thanks for the report. Looks like you have a corrupt archive. I'll handle and log those sort of exceptions.

lpinner commented 8 years ago

I've committed a fix to the develop branch if you'd like to test.

I have been unable to replicate as every corrupt gz file I have created/tested has failed the tarfile.is_tarfile() test so they never reach the archivelist() function.

alexlopespereira commented 8 years ago

@lpinner I tested this new version and it works as expected. It handles the exception and continuous crawling.

Thank you very much. Alex