I spotted a PDF file with an "owner password" set in some logs. The file is at https://www.si.edu/Content/OIG/SAR/Semiannual_Report_033116.pdf. These can be opened by most readers, though various features are disabled according to file metadata. (i.e. printing, copying, extracting pages, even a11y/screen readers) The Poppler devs play along with Adobe's rules here, which means we get the error messages below. Evince has a bad day with this file too.
To fix this, we'll need to do something like this before extracting text and metadata.
[semiannual_report][2016-03-31][Semiannual_Report_033116]
report: smithsonian/2016/Semiannual_Report_033116/report.pdf
Syntax Error: Invalid encryption key length
Command Line Error: Incorrect password
Error extracting metadata for smithsonian/2016/Semiannual_Report_033116/report.pdf:
Traceback (most recent call last):
File "inspectors/utils/utils.py", line 488, in metadata_from_pdf
output = subprocess.check_output(["pdfinfo", real_pdf_path], shell=False)
File "/usr/lib/python3.4/subprocess.py", line 620, in check_output
raise CalledProcessError(retcode, process.args, output=output)
subprocess.CalledProcessError: Command '['pdfinfo', '/home/david/inspectors-general/data/smithsonian/2016/Semiannual_Report_033116/report.pdf']' returned non-zero exit status 1
Syntax Error: Invalid encryption key length
Command Line Error: Incorrect password
Error extracting text to /home/david/inspectors-general/data/smithsonian/2016/Semiannual_Report_033116/report.txt:
Traceback (most recent call last):
File "inspectors/utils/utils.py", line 391, in text_from_pdf
real_text_path], shell=False)
File "/usr/lib/python3.4/subprocess.py", line 561, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['pdftotext', '-layout', '-nopgbrk', '/home/david/inspectors-general/data/smithsonian/2016/Semiannual_Report_033116/report.pdf', '/home/david/inspectors-general/data/smithsonian/2016/Semiannual_Report_033116/report.txt']' returned non-zero exit status 1```
I spotted a PDF file with an "owner password" set in some logs. The file is at https://www.si.edu/Content/OIG/SAR/Semiannual_Report_033116.pdf. These can be opened by most readers, though various features are disabled according to file metadata. (i.e. printing, copying, extracting pages, even a11y/screen readers) The Poppler devs play along with Adobe's rules here, which means we get the error messages below. Evince has a bad day with this file too.
To fix this, we'll need to do something like this before extracting text and metadata.