unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

Handle 404s gracefully #67

Closed konklone closed 10 years ago

konklone commented 10 years ago

Apparently they cause the scraper to just stop?

Error downloading https://oig.hhs.gov/files/OIG-Strategic-Plan-2014-2018.pdf:

Traceback (most recent call last):

  File "/home/unitedstates/inspectors-general/inspectors/utils/utils.py", line 83, in download
    response = scraper.urlopen(url)

  File "/home/unitedstates/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 393, in urlopen
    raise HTTPError(resp)

scrapelib.HTTPError: 404 while retrieving https://oig.hhs.gov/files/OIG-Strategic-Plan-2014-2018.pdf

A note to re-run the HHS scraper for its archive after this is fixed, too. (And to contact HHS OIG about the 404.)

LindsayYoung commented 10 years ago

Do you mind if I take the hhs scraper off the safe list in the mean time?

LindsayYoung commented 10 years ago

Actually, its fine. I just commented out the scraper in safe.yml on the server.

Thanks again to @spulec for his great work!

konklone commented 10 years ago

@LindsayYoung I was wrong about them causing the scraper to stop -- I thought this because the output was at the bottom of my scraper logs, but that was because they're using print() and not logging. I switched a bunch of other print() calls in https://github.com/unitedstates/inspectors-general/commit/782a799867c23158f7343f064ab263da3114cb26 but left the 404 one, because it actually is convenient to see all the 404s at the bottom of the logs.

But these do not cause the scraper to hang, or to email the admin, so I think they are still safe for safe.yml.

LindsayYoung commented 10 years ago

This morning, I ran the new scrapers locally and hhs.py errored out for me. I can give it another look.

Thanks again!

konklone commented 10 years ago

Ah, hhs is crashing because of a 404 during the scraping process (one of the landing pages, not downloading a report), so that's a context-specific error that it should choke on. Worth commenting it out of safe.yml, but not an issue with handling 404s gracefully. (So, closing this specific issue.)