unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
107 stars 21 forks source link

DOD IG seems to be trying to scrape robots.txt #92

Closed LindsayYoung closed 10 years ago

LindsayYoung commented 10 years ago

I am getting an error.

Traceback (most recent call last):

  File "/projects/congress-api/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 152, in _robot_allowed
    parser = self._robot_parsers[robots_url]

KeyError: 'http://www.dodig.mil/robots.txt'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "inspectors/utils/utils.py", line 24, in run
    run_method(cli_options)

  File "inspectors/dod.py", line 113, in run
    body = utils.download(url)

  File "inspectors/utils/utils.py", line 84, in download
    response = scraper.urlopen(url)

  File "/projects/congress-api/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 390, in urlopen
    resp = self.request(method, url, data=body, retry_on_404=retry_on_404, **kwargs)

  File "/projects/congress-api/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 369, in request
    headers=headers, **kwargs)

  File "/projects/congress-api/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 170, in request
    not self._robot_allowed(user_agent, parsed_url)):

  File "/projects/congress-api/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 158, in _robot_allowed
    parser.read()

  File "/projects/congress-api/.pyenv/versions/3.4.0/lib/python3.4/urllib/robotparser.py", line 56, in read
    f = urllib.request.urlopen(self.url)

  File "/projects/congress-api/.pyenv/versions/3.4.0/lib/python3.4/urllib/request.py", line 153, in urlopen
    return opener.open(url, data, timeout)

  File "/projects/congress-api/.pyenv/versions/3.4.0/lib/python3.4/urllib/request.py", line 455, in open
    response = self._open(req, data)

  File "/projects/congress-api/.pyenv/versions/3.4.0/lib/python3.4/urllib/request.py", line 473, in _open
    '_open', req)

  File "/projects/congress-api/.pyenv/versions/3.4.0/lib/python3.4/urllib/request.py", line 433, in _call_chain
    result = func(*args)

  File "/projects/congress-api/.pyenv/versions/3.4.0/lib/python3.4/urllib/request.py", line 1261, in http_open
    return self.do_open(http.client.HTTPConnection, req)

  File "/projects/congress-api/.pyenv/versions/3.4.0/lib/python3.4/urllib/request.py", line 1240, in do_open
    r = h.getresponse()

  File "/projects/congress-api/.pyenv/versions/3.4.0/lib/python3.4/http/client.py", line 1148, in getresponse
    response.begin()

  File "/projects/congress-api/.pyenv/versions/3.4.0/lib/python3.4/http/client.py", line 352, in begin
    version, status, reason = self._read_status()

  File "/projects/congress-api/.pyenv/versions/3.4.0/lib/python3.4/http/client.py", line 314, in _read_status
    line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")

  File "/projects/congress-api/.pyenv/versions/3.4.0/lib/python3.4/socket.py", line 371, in readinto
    return self._sock.recv_into(b)

TimeoutError: [Errno 110] Connection timed out
LindsayYoung commented 10 years ago

I am getting another robots related error on the exim scraper.

Traceback (most recent call last):

  File "inspectors/utils/utils.py", line 24, in run
    run_method(cli_options)

  File "inspectors/exim.py", line 15, in run
    body = utils.download(page_url)

  File "inspectors/utils/utils.py", line 84, in download
    response = scraper.urlopen(url)

  File "/projects/congress-api/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 390, in urlopen
    resp = self.request(method, url, data=body, retry_on_404=retry_on_404, **kwargs)

  File "/projects/congress-api/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 369, in request
    headers=headers, **kwargs)

  File "/projects/congress-api/.virtualenvs/inspectors/lib/python3.4/site-packages/scrapelib/__init__.py", line 173, in request
    user_agent, url), url, user_agent)

scrapelib.RobotExclusionError: User-Agent 'unitedstates/inspectors-general (https://github.com/unitedstates/inspectors-general)' not allowed at 'http://www.exim.gov/oig/index.cfm'
konklone commented 10 years ago

Run pip install -r requirements.txt -- I updated the version of scrapelib in https://github.com/unitedstates/inspectors-general/commit/31ca91df8b59721b8fea6235f03386b72a93158a#commitcomment-7143210, but it's actually a backwards-incompatible upgrade. If that doesn't fix it, please re-open and we'll figure it out.

konklone commented 10 years ago

And to be clear, it should upgrade scrapelib from 0.9.x to 0.10.x.