unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

Having trouble with labor, exim and dod scrapers #99

Closed LindsayYoung closed 10 years ago

LindsayYoung commented 10 years ago

DOD is just timing out I can't tell if it is the scraper or on their end, but I think labor and exim are having problems.

labor:

(inspectors)lindsay:inspectors-general lindsayyoung$ ./inspectors/labor.py 
Traceback (most recent call last):

  File "/Users/lindsayyoung/Dropbox/Projects/inspectors-general/inspectors/utils/utils.py", line 24, in run
    run_method(cli_options)

  File "./inspectors/labor.py", line 40, in run
    doc = beautifulsoup_from_url(year_url)

  File "./inspectors/labor.py", line 146, in beautifulsoup_from_url
    body = utils.download(url)

  File "/Users/lindsayyoung/Dropbox/Projects/inspectors-general/inspectors/utils/utils.py", line 84, in download
    response = scraper.urlopen(url)

  File "/Users/lindsayyoung/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 390, in urlopen
    resp = self.request(method, url, data=body, retry_on_404=retry_on_404, **kwargs)

  File "/Users/lindsayyoung/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 369, in request
    headers=headers, **kwargs)

  File "/Users/lindsayyoung/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 173, in request
    user_agent, url), url, user_agent)

scrapelib.RobotExclusionError: User-Agent 'unitedstates/inspectors-general (https://github.com/unitedstates/inspectors-general)' not allowed at 'http://www.oig.dol.gov/cgi-bin/oa_rpts.cgi?s=&y=fy92014&next_i=0&a=all'

exim:

Traceback (most recent call last):

  File "/Users/lindsayyoung/Dropbox/Projects/inspectors-general/inspectors/utils/utils.py", line 24, in run
    run_method(cli_options)

  File "./inspectors/exim.py", line 15, in run
    body = utils.download(page_url)

  File "/Users/lindsayyoung/Dropbox/Projects/inspectors-general/inspectors/utils/utils.py", line 84, in download
    response = scraper.urlopen(url)

  File "/Users/lindsayyoung/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 390, in urlopen
    resp = self.request(method, url, data=body, retry_on_404=retry_on_404, **kwargs)

  File "/Users/lindsayyoung/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 369, in request
    headers=headers, **kwargs)

  File "/Users/lindsayyoung/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 173, in request
    user_agent, url), url, user_agent)

scrapelib.RobotExclusionError: User-Agent 'unitedstates/inspectors-general (https://github.com/unitedstates/inspectors-general)' not allowed at 'http://www.exim.gov/oig/index.cfm'
LindsayYoung commented 10 years ago

I needed to re-install requirements.

konklone commented 10 years ago

This is the same as #92 -- you need to update scrapelib.

Background: scrapelib made a backwards-incompatible change when they went to 0.10.0, that required me to remove follow_robots=False from scrapelib initialization in https://github.com/unitedstates/inspectors-general/commit/31ca91df8b59721b8fea6235f03386b72a93158a. In 0.10.0, robots.txt is never followed, that feature was removed entirely. However, in 0.9.0, following robots.txt is the default, so running the code in the repository now with scrapelib 0.9.0 installed causes broken behavior.

This exact issue also bit a couple of other unitedstates repos: https://github.com/unitedstates/congress/pull/140 and https://github.com/unitedstates/congress-legislators/pull/199.