Closed spulec closed 10 years ago
This is so....relaxing. Thanks, @spulec!
I added a tiny commit chopping the index.cfm
off the main inspector_url
, as it will properly redirect. Seems slightly more future-proof, though forces redirects on people.
A bunch of 404s when downloading the full Interior archive, it's Interior's fault, these are links that appear:
Error downloading http://www.doi.gov/oig/reports/upload/Report-of-Investigation---Pensus-Public.pdf:
Traceback (most recent call last):
File "/home/unitedstates/inspectors-general/inspectors/utils/utils.py", line 83, in download
response = scraper.urlopen(url)
File "/home/unitedstates/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 393, in urlopen
raise HTTPError(resp)
scrapelib.HTTPError: 404 while retrieving http://www.doi.gov/oig/reports/upload/Report-of-Investigation---Pensus-Public.pdf
Error downloading http://www.doi.gov/oig/reports/upload/USBR-Exclusive-Use---Public.pdf:
Traceback (most recent call last):
File "/home/unitedstates/inspectors-general/inspectors/utils/utils.py", line 83, in download
response = scraper.urlopen(url)
File "/home/unitedstates/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 393, in urlopen
raise HTTPError(resp)
scrapelib.HTTPError: 404 while retrieving http://www.doi.gov/oig/reports/upload/USBR-Exclusive-Use---Public.pdf
Error downloading http://www.doi.gov/oig/reports/upload/WR-VS-GSV-0008-2010-dtd-9-22-10-Verf-of-recs-1,-2,-&-3-from-ER-EV-GSV-0002-2009.pdf:
Traceback (most recent call last):
File "/home/unitedstates/inspectors-general/inspectors/utils/utils.py", line 83, in download
response = scraper.urlopen(url)
File "/home/unitedstates/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 393, in urlopen
raise HTTPError(resp)
scrapelib.HTTPError: 404 while retrieving http://www.doi.gov/oig/reports/upload/WR-VS-GSV-0008-2010-dtd-9-22-10-Verf-of-recs-1,-2,-&-3-from-ER-EV-GSV-0002-2009.pdf
Error downloading http://www.doi.gov/oig/reports/upload/Lake-Jackson---Helicopter_508.pdf:
Traceback (most recent call last):
File "/home/unitedstates/inspectors-general/inspectors/utils/utils.py", line 83, in download
response = scraper.urlopen(url)
File "/home/unitedstates/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 393, in urlopen
raise HTTPError(resp)
scrapelib.HTTPError: 404 while retrieving http://www.doi.gov/oig/reports/upload/Lake-Jackson---Helicopter_508.pdf
Error downloading http://www.doi.gov/oig/reports/upload/WR-VS-BOR-0010-2010-dtd-9.3.10-Verf-Rev-of-2-recs-from-99-I-133.pdf:
Traceback (most recent call last):
File "/home/unitedstates/inspectors-general/inspectors/utils/utils.py", line 83, in download
response = scraper.urlopen(url)
File "/home/unitedstates/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 393, in urlopen
raise HTTPError(resp)
scrapelib.HTTPError: 404 while retrieving http://www.doi.gov/oig/reports/upload/WR-VS-BOR-0010-2010-dtd-9.3.10-Verf-Rev-of-2-recs-from-99-I-133.pdf
Error downloading http://www.doi.gov/oig/reports/upload/WR-VS-MOA-0009-2010-dtd-8.23.10-Verf-Rev-of-6-Recs-from-Y-EV-MOA-0001-2008.pdf:
Traceback (most recent call last):
File "/home/unitedstates/inspectors-general/inspectors/utils/utils.py", line 83, in download
response = scraper.urlopen(url)
File "/home/unitedstates/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 393, in urlopen
raise HTTPError(resp)
scrapelib.HTTPError: 404 while retrieving http://www.doi.gov/oig/reports/upload/WR-VS-MOA-0009-2010-dtd-8.23.10-Verf-Rev-of-6-Recs-from-Y-EV-MOA-0001-2008.pdf
Error downloading http://www.doi.gov/oig/reports/upload/ROO-ROA-MOA-1018-2010.pdf:
Traceback (most recent call last):
File "/home/unitedstates/inspectors-general/inspectors/utils/utils.py", line 83, in download
response = scraper.urlopen(url)
File "/home/unitedstates/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 393, in urlopen
raise HTTPError(resp)
scrapelib.HTTPError: 404 while retrieving http://www.doi.gov/oig/reports/upload/ROO-ROA-MOA-1018-2010.pdf
Error downloading http://www.doi.gov/oig/reports/upload/FY-2009-Fisma-Report---Revised.pdf:
Traceback (most recent call last):
File "/home/unitedstates/inspectors-general/inspectors/utils/utils.py", line 83, in download
response = scraper.urlopen(url)
File "/home/unitedstates/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 393, in urlopen
raise HTTPError(resp)
scrapelib.HTTPError: 404 while retrieving http://www.doi.gov/oig/reports/upload/FY-2009-Fisma-Report---Revised.pdf
Error downloading http://www.doi.gov/oig/reports/upload/Semi-Fin-11.2.09.pdf:
Traceback (most recent call last):
File "/home/unitedstates/inspectors-general/inspectors/utils/utils.py", line 83, in download
response = scraper.urlopen(url)
File "/home/unitedstates/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 393, in urlopen
raise HTTPError(resp)
scrapelib.HTTPError: 404 while retrieving http://www.doi.gov/oig/reports/upload/Semi-Fin-11.2.09.pdf
Error downloading http://www.doi.gov/oig/reports/upload/2008-CD&L-Investigative-Report-REDACTED-with-transmittal.pdf:
Traceback (most recent call last):
File "/home/unitedstates/inspectors-general/inspectors/utils/utils.py", line 83, in download
response = scraper.urlopen(url)
File "/home/unitedstates/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 393, in urlopen
raise HTTPError(resp)
scrapelib.HTTPError: 404 while retrieving http://www.doi.gov/oig/reports/upload/2008-CD&L-Investigative-Report-REDACTED-with-transmittal.pdf
Error downloading http://www.doi.gov/oig/reports/upload/ManagementAdvisory(post-CDL)edited07-02-08_cd2.pdf:
Traceback (most recent call last):
File "/home/unitedstates/inspectors-general/inspectors/utils/utils.py", line 83, in download
response = scraper.urlopen(url)
File "/home/unitedstates/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 393, in urlopen
raise HTTPError(resp)
scrapelib.HTTPError: 404 while retrieving http://www.doi.gov/oig/reports/upload/ManagementAdvisory(post-CDL)edited07-02-08_cd2.pdf
I guess I need to contact the IG about it. Annoying.
Good news! Although the links are wrong, they are wrong in a consistent way. It appears to have to do with their slugification with spaces around hyphens.
Listed url: http://www.doi.gov/oig/reports/upload/Report-of-Investigation---Pensus-Public.pdf Correct url: http://www.doi.gov/oig/reports/upload/Report-of-Investigation-Pensus-Public.pdf
I added a simple string replace, that I'm hoping won't negatively impact other reports.
See https://github.com/unitedstates/inspectors-general/commit/d008e2fb14234bc72588f16c0f93362575f8f005
Nice catch! But it doesn't catch all of them, here's the condensed list of 10 404s from above -- not all use triple dashes:
I just sent a note to the IG about these, and linking to this thread. Hopefully they can fix them on their end.
Ah, that's frustrating. Let me know if you get a response so we can revert my most recent commit.
This was pretty straightforward (I wish they were all this easy).