unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

Take two, shell=False #111

Closed divergentdave closed 10 years ago

divergentdave commented 10 years ago

Interpolate env variables in Python, shell=False See #84

konklone commented 10 years ago

Hmm. Maybe it wasn't the path expansion that was needed? I'm still getting this error when running the usps scraper on this PR branch:

GET - https://uspsoig.gov/sites/default/files/document-library-files/2014/rarc-ib-14-003-dr.pdf
"GET /sites/default/files/document-library-files/2014/rarc-ib-14-003-dr.pdf HTTP/1.1" 200 1041223
    report: usps/2014/rarc-ib-14-003-dr/report.pdf
Traceback (most recent call last):

  File "/home/eric/unitedstates/inspectors-general/inspectors/utils/utils.py", line 24, in run
    run_method(cli_options)

  File "./inspectors/usps.py", line 65, in run
    inspector.save_report(report)

  File "/home/eric/unitedstates/inspectors-general/inspectors/utils/inspector.py", line 44, in save_report
    metadata = extract_metadata(report)

  File "/home/eric/unitedstates/inspectors-general/inspectors/utils/inspector.py", line 160, in extract_metadata
    metadata = utils.metadata_from_pdf(report_path)

  File "/home/eric/unitedstates/inspectors-general/inspectors/utils/utils.py", line 188, in metadata_from_pdf
    output = subprocess.check_output("pdfinfo \"%s\"" % (real_pdf_path), shell=False)

  File "/home/eric/.pyenv/versions/3.4.0/lib/python3.4/subprocess.py", line 605, in check_output
    with Popen(*popenargs, stdout=PIPE, **kwargs) as process:

  File "/home/eric/.pyenv/versions/3.4.0/lib/python3.4/subprocess.py", line 848, in __init__
    restore_signals, start_new_session)

  File "/home/eric/.pyenv/versions/3.4.0/lib/python3.4/subprocess.py", line 1441, in _execute_child
    raise child_exception_type(errno_num, err_msg)

FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo "/home/eric/unitedstates/inspectors-general/data/usps/2014/rarc-ib-14-003-dr/report.pdf"'

But the PDF is there, and when I run the command myself, in my terminal, it works fine:

$ pdfinfo "/home/eric/unitedstates/inspectors-general/data/usps/2014/rarc-ib-14-003-dr/report.pdf"

Title:          RARC-IB-14-003-DR Government as a Postal Customer and Partner: International Round Table Recap
Subject:        RARC-IB-14-003-DR Government as a Postal Customer and Partner: International Round Table Recap
Keywords:       USPS OIG; innovation; issue brief; round table; government services; business; postal service; platform; new services; business plan; UPU; Universal Postal Union; Swiss Post; Poste Italiane; New Zealand Post; Postal Union of the Americas; PIP; EPFL; American University; NIC Technologies; IBM; Elmar Toime; USPS
Author:         United States Postal Service Office of the Inspector General
Creator:        Adobe InDesign CS6 (Macintosh)
Producer:       Adobe PDF Library 10.0.1
CreationDate:   Fri Aug  8 14:56:13 2014
ModDate:        Fri Aug  8 14:59:59 2014
Tagged:         no
Form:           AcroForm
Pages:          16
Encrypted:      yes (print:yes copy:yes change:no addNotes:no algorithm:AES)
Page size:      1024 x 768 pts
Page rot:       0
File size:      1041223 bytes
Optimized:      yes
PDF version:    1.7

This may be tough for you to debug from a Windows system, I'm assuming it works there -- if you can't figure it out, let me know and I'll work on this over the weekend.

divergentdave commented 10 years ago

My next guess is that the command needs to be split into a list of strings, rather than space separated. Could you give that a try? On Aug 11, 2014 10:01 PM, "Eric Mill" notifications@github.com wrote:

Hmm. Maybe it wasn't the path expansion that was needed? I'm still getting this error when running the usps scraper on this PR branch:

GET - https://uspsoig.gov/sites/default/files/document-library-files/2014/rarc-ib-14-003-dr.pdf "GET /sites/default/files/document-library-files/2014/rarc-ib-14-003-dr.pdf HTTP/1.1" 200 1041223 report: usps/2014/rarc-ib-14-003-dr/report.pdf Traceback (most recent call last):

File "/home/eric/unitedstates/inspectors-general/inspectors/utils/utils.py", line 24, in run run_method(cli_options)

File "./inspectors/usps.py", line 65, in run inspector.save_report(report)

File "/home/eric/unitedstates/inspectors-general/inspectors/utils/inspector.py", line 44, in save_report metadata = extract_metadata(report)

File "/home/eric/unitedstates/inspectors-general/inspectors/utils/inspector.py", line 160, in extract_metadata metadata = utils.metadata_from_pdf(report_path)

File "/home/eric/unitedstates/inspectors-general/inspectors/utils/utils.py", line 188, in metadata_from_pdf output = subprocess.check_output("pdfinfo \"%s\"" % (real_pdf_path), shell=False)

File "/home/eric/.pyenv/versions/3.4.0/lib/python3.4/subprocess.py", line 605, in check_output with Popen(_popenargs, stdout=PIPE, *_kwargs) as process:

File "/home/eric/.pyenv/versions/3.4.0/lib/python3.4/subprocess.py", line 848, in init restore_signals, start_new_session)

File "/home/eric/.pyenv/versions/3.4.0/lib/python3.4/subprocess.py", line 1441, in _execute_child raise child_exception_type(errno_num, err_msg)

FileNotFoundError: [Errno 2] No such file or directory: 'pdfinfo "/home/eric/unitedstates/inspectors-general/data/usps/2014/rarc-ib-14-003-dr/report.pdf"'

But the PDF is there, and when I run the command myself, in my terminal, it works fine:

$ pdfinfo "/home/eric/unitedstates/inspectors-general/data/usps/2014/rarc-ib-14-003-dr/report.pdf"

Title: RARC-IB-14-003-DR Government as a Postal Customer and Partner: International Round Table Recap Subject: RARC-IB-14-003-DR Government as a Postal Customer and Partner: International Round Table Recap Keywords: USPS OIG; innovation; issue brief; round table; government services; business; postal service; platform; new services; business plan; UPU; Universal Postal Union; Swiss Post; Poste Italiane; New Zealand Post; Postal Union of the Americas; PIP; EPFL; American University; NIC Technologies; IBM; Elmar Toime; USPS Author: United States Postal Service Office of the Inspector General Creator: Adobe InDesign CS6 (Macintosh) Producer: Adobe PDF Library 10.0.1 CreationDate: Fri Aug 8 14:56:13 2014 ModDate: Fri Aug 8 14:59:59 2014 Tagged: no Form: AcroForm Pages: 16 Encrypted: yes (print:yes copy:yes change:no addNotes:no algorithm:AES) Page size: 1024 x 768 pts Page rot: 0 File size: 1041223 bytes Optimized: yes PDF version: 1.7

This may be tough for you to debug from a Windows system, I'm assuming it works there -- if you can't figure it out, let me know and I'll work on this over the weekend.

— Reply to this email directly or view it on GitHub https://github.com/unitedstates/inspectors-general/pull/111#issuecomment-51873832 .

konklone commented 10 years ago

Bingo! Fixed and merged. Thanks, @divergentdave.