unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
106 stars 21 forks source link

Password-protected PDFs #283

Closed divergentdave closed 7 years ago

divergentdave commented 8 years ago

I spotted a PDF file with an "owner password" set in some logs. The file is at https://www.si.edu/Content/OIG/SAR/Semiannual_Report_033116.pdf. These can be opened by most readers, though various features are disabled according to file metadata. (i.e. printing, copying, extracting pages, even a11y/screen readers) The Poppler devs play along with Adobe's rules here, which means we get the error messages below. Evince has a bad day with this file too.

To fix this, we'll need to do something like this before extracting text and metadata.

screenshot from 2016-06-30 21 46 22

[semiannual_report][2016-03-31][Semiannual_Report_033116]
  report: smithsonian/2016/Semiannual_Report_033116/report.pdf
Syntax Error: Invalid encryption key length
Command Line Error: Incorrect password
Error extracting metadata for smithsonian/2016/Semiannual_Report_033116/report.pdf:

Traceback (most recent call last):

  File "inspectors/utils/utils.py", line 488, in metadata_from_pdf
    output = subprocess.check_output(["pdfinfo", real_pdf_path], shell=False)

  File "/usr/lib/python3.4/subprocess.py", line 620, in check_output
    raise CalledProcessError(retcode, process.args, output=output)

subprocess.CalledProcessError: Command '['pdfinfo', '/home/david/inspectors-general/data/smithsonian/2016/Semiannual_Report_033116/report.pdf']' returned non-zero exit status 1

Syntax Error: Invalid encryption key length
Command Line Error: Incorrect password
Error extracting text to /home/david/inspectors-general/data/smithsonian/2016/Semiannual_Report_033116/report.txt:

Traceback (most recent call last):

  File "inspectors/utils/utils.py", line 391, in text_from_pdf
    real_text_path], shell=False)

  File "/usr/lib/python3.4/subprocess.py", line 561, in check_call
    raise CalledProcessError(retcode, cmd)

subprocess.CalledProcessError: Command '['pdftotext', '-layout', '-nopgbrk', '/home/david/inspectors-general/data/smithsonian/2016/Semiannual_Report_033116/report.pdf', '/home/david/inspectors-general/data/smithsonian/2016/Semiannual_Report_033116/report.txt']' returned non-zero exit status 1```