Crash in TeXKeys extraction

kaplun commented 6 years ago

Given the PDF available at: http://arxiv.org/pdf/1710.01077 refextract crashes in PyPDF2 code:

Traceback (most recent call last):
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 529, in _process
    self.run_callbacks(callbacks, objects, obj)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 465, in run_callbacks
    indent + 1)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 465, in run_callbacks
    indent + 1)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 481, in run_callbacks
    self.execute_callback(callback_func, obj)
  File "/opt/inspire/lib/python2.7/site-packages/workflow/engine.py", line 564, in execute_callback
    callback(obj, self)
  File "/opt/inspire/src/inspire/inspirehep/modules/workflows/utils.py", line 135, in _decorator
    res = func(*args, **kwargs)
  File "/opt/inspire/src/inspire/inspirehep/modules/workflows/tasks/actions.py", line 238, in refextract
    references = extract_references(uri, source)
  File "/opt/inspire/lib/python2.7/site-packages/timeout_decorator/timeout_decorator.py", line 81, in new_function
    return function(*args, **kwargs)
  File "/opt/inspire/src/inspire/inspirehep/modules/workflows/tasks/refextract.py", line 95, in extract_references
    reference_format=u'{title},{volume},{page}'
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/api.py", line 149, in extract_references_from_file
    texkeys = extract_texkeys_from_pdf(path)
  File "/opt/inspire/lib/python2.7/site-packages/refextract/references/pdf.py", line 54, in extract_texkeys_from_pdf
    pdf = PdfFileReader(pdf_stream, strict=False)
  File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1084, in __init__
    self.read(stream)
  File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1803, in read
    idnum, generation = self.readObjectHeader(stream)
  File "/opt/inspire/lib/python2.7/site-packages/PyPDF2/pdf.py", line 1667, in readObjectHeader
    return int(idnum), int(generation)
ValueError: invalid literal for int() with base 10: 'f'

It should instead handle the exception and continue without extracting TeXKeys.

michamos commented 6 years ago

looks like PyPDF2 is really brittle. The crash happens when trying to parse the PDF, so nothing we could easily fix. Maybe we should wrap calls to PyPDF2 in a big

try:
    # call PyPDF2
except Exception as e:
    # log the exception

kaplun commented 6 years ago

Yeah exactly.

michamos commented 6 years ago

we wouldn't lose much anyway: texkey extraction is useful only for articles using Inspire texkeys (and maybe other platforms like ADS in the future). Those will in the vast majority of cases be produced by a standard TeX pipeline, which we know works well with PyPDF2.

inspirehep / refextract

Crash in TeXKeys extraction #40