kanzure / pdfparanoia

pdf watermark removal library for academic papers
https://pypi.python.org/pypi/pdfparanoia
533 stars 52 forks source link

Investigate possible pdf truncation #22

Open kanzure opened 11 years ago

kanzure commented 11 years ago

@zooko reports pdf truncation occurring in some of the processed pdfs.

kanzure commented 11 years ago
RESF zompu:/tmp$ PYTHONPATH=~/playground/pdfparanoia/ ~/playground/pdfparanoia/bin/pdfparanoia -v -v -v ./Hemingway-2001-The_Ketogenic_Diet__A_3-_To_6-year_Follow-up_Of_150_Children_Enrolled_Prospectively.pdf   > foo
Traceback (most recent call last):
  File "/home/zooko/playground/pdfparanoia/bin/pdfparanoia", line 33, in <module>
    output = pdfparanoia.scrub(StringIO(content), verbose=verbose)
  File "/home/zooko/playground/pdfparanoia/pdfparanoia/core.py", line 53, in scrub
    content = plugin.scrub(content, verbose=verbose)
  File "/home/zooko/playground/pdfparanoia/pdfparanoia/plugins/ieee.py", line 39, in scrub
    data = copy(obj.get_data())
  File "/usr/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 256, in get_data
    self.decode()
  File "/usr/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 239, in decode
    raise PDFNotImplementedError('Unsupported predictor: %r' % pred)
pdfminer.pdftypes.PDFNotImplementedError: Unsupported predictor: 2

Exceptions should be caught when calling plugins, so that a plugin doesn't stop pdfparanoia from dumping the original pdf.

kanzure commented 11 years ago

Experiencing this error again...

Traceback (most recent call last):
  File "./bin/pdfparanoia", line 34, in <module>
    outputcontent = pdfparanoia.scrub(StringIO(Args.in_pdf.read()), verbose=verbose)
  File "/home/kanzure/code/pdfparanoia/pdfparanoia/core.py", line 53, in scrub
    content = plugin.scrub(content, verbose=verbose)
  File "/home/kanzure/code/pdfparanoia/pdfparanoia/plugins/ieee.py", line 39, in scrub
    data = copy(obj.get_data())
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 256, in get_data
    self.decode()
  File "/usr/local/lib/python2.7/dist-packages/pdfminer/pdftypes.py", line 239, in decode
    raise PDFNotImplementedError('Unsupported predictor: %r' % pred)
pdfminer.pdftypes.PDFNotImplementedError: Unsupported predictor: 2

Command line:

kanzure@raichu:~/code/pdfparanoia$ ./bin/pdfparanoia tests/samples/sciencemag/copy-007.pdf -o copy-007-edited.pdf