inspirehep / plotextractor

Extract images and captions from TeX files in a tar archive.
GNU General Public License v2.0
3 stars 9 forks source link

Ignore MacOS hidden Metadata files #12

Closed kaplun closed 4 years ago

kaplun commented 8 years ago

See: ahem. @david-caro ?

david-caro commented 8 years ago

Sometimes the tarballs with the record contents have some hidden files with metadata from MacOS systems, that have the same extension as the files they store metadata for. The plotextractor tries to parse those too and fails as they are not actually images:

Traceback (most recent call last):
  File "/opt/inspire/lib/python2.7/site-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/opt/inspire/lib/python2.7/site-packages/celery/app/trace.py", line 438, in __protected_call__
    return self.run(*args, **kwargs)
  File "/opt/inspire/lib/python2.7/site-packages/invenio_base/helpers.py", line 49, in decorated_func
    result = f(*args, **kwargs)
  File "/opt/inspire/lib/python2.7/site-packages/invenio_workflows/workers/worker_celery.py", line 49, in celery_run
    return run_worker(workflow_name, data, **kwargs).uuid
  File "/opt/inspire/lib/python2.7/site-packages/invenio_workflows/worker_engine.py", line 47, in run_worker
    run_workflow(wfe=engine, data=objects, **kwargs)
  File "/opt/inspire/lib/python2.7/site-packages/invenio_workflows/client.py", line 103, in run_workflow
    raise exception_triggered
WorkflowError: WorkflowError(Error: CorruptImageError('Not a JPEG file: starts with 0x00 0x05 /afs/.../workflows/storage/8d2da946-64e0-11e6-8cee-02163e010841/1608.04885.tar.gz_files/pics/._distance_matrix.jpg @ error/jpeg.c/JPEGErrorHandler/297',)
Traceback (most recent call last):
  File "/opt/inspire/lib/python2.7/site-packages/invenio_workflows/engine.py", line 429, in processing_factory
    self.run_callbacks(callbacks, objects, obj)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 422, in run_callbacks
    self.execute_callback(f, obj)
  File "/opt/inspire/lib/python2.7/site-packages/invenio_workflows/engine.py", line 512, in execute_callback
    callback(obj, self)
  File "/opt/inspire/src/inspire/inspirehep/modules/oaiharvester/tasks/arxiv.py", line 169, in arxiv_plot_extract
    plots = process_tarball(tarball)
  File "/opt/inspire/lib/python2.7/site-packages/plotextractor/api.py", line 77, in process_tarball
    converted_image_mapping = convert_images(image_list)
  File "/opt/inspire/lib/python2.7/site-packages/plotextractor/converter.py", line 160, in convert_images
    convert_image(image_file, converted_image_file, image_format)
  File "/opt/inspire/lib/python2.7/site-packages/plotextractor/converter.py", line 172, in convert_image
jacquerie commented 7 years ago

An example from Sentry: https://sentry.cern.ch/inspire-sentry/inspire-labs/group/821464/

michamos commented 4 years ago

Fixed in #17.