unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
106 stars 21 forks source link

[sba] Handle landing page errors more gracefully, properly respect year range #174

Closed konklone closed 9 years ago

konklone commented 9 years ago

The sba scraper wasn't obeying year range if the published_on timestamp wasn't found early on. There's a way in the scraper to hardcode publication dates or find them through other means, but by the time the scraper had gotten there it no longer bothered respecting the year_range. I've fixed that, which will make the scraper more efficient for regular running.

I'm also getting an error when fetching a particular landing page -

Traceback (most recent call last):

  File "inspectors/utils/utils.py", line 27, in run
    return run_method(cli_options)

  File "inspectors/sba.py", line 68, in run
    report = report_from(result, year_range)

  File "inspectors/sba.py", line 117, in report_from
    landing_page = BeautifulSoup(landing_body)

  File "/home/unitedstates/.virtualenvs/inspectors/lib/python3.4/site-packages/bs4/__init__.py", line 162, in __init__
    elif len(markup) <= 256:

TypeError: object of type 'NoneType' has no len()

That's from fetching this page, which gets linked at the date-less entry in this screenshot. There are no permalinks -- I found this by searching for the keyword "originating" and looking at the bottom of page 4 of results.

okay-bad-sba

The new behavior throws a proper exception, but I'm not sure how to handle this. The SBA site is throwing a 500, so I'll report it to the IG. But I don't want the scraper to just skip it. I'll punt on that for a bit, after notifying SBA.

divergentdave commented 9 years ago

LGTM :+1: Regarding the broken link, it looks like the single quote marks in the URL are causing problems. If I take them out, I get a regular 404. Google cache has a copy, so this is a recent problem. My guess is they just added a web application firewall. :smirk: