unitedstates / inspectors-general

Collecting reports from Inspectors General across the US federal government.
https://sunlightfoundation.com/blog/2014/11/07/opengov-voices-opening-up-government-reports-through-teamwork-and-open-data/
Creative Commons Zero v1.0 Universal
106 stars 21 forks source link

[doj] Fix URLs for new subdomain, misc. cleanup #229

Closed divergentdave closed 9 years ago

divergentdave commented 9 years ago

The OIG's website has moved from http://www.justice.gov/oig/ to https://oig.justice.gov/

konklone commented 9 years ago

It looks good, and thanks for doing a once-over on fixes here. I got this error, though:

$ ./inspectors/doj.py --since=2014 --debug
https://oig.justice.gov/reports/components.htm
## Downloading: https://oig.justice.gov/reports/components.htm
GET - https://oig.justice.gov/reports/components.htm
Starting new HTTPS connection (1): oig.justice.gov
Error downloading https://oig.justice.gov/reports/components.htm:

Traceback (most recent call last):

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 544, in urlopen
    body=body, headers=headers)

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 341, in _make_request
    self._validate_conn(conn)

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 761, in _validate_conn
    conn.connect()

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 238, in connect
    ssl_version=resolved_ssl_version)

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/packages/urllib3/util/ssl_.py", line 279, in ssl_wrap_socket
    return context.wrap_socket(sock, server_hostname=server_hostname)

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/ssl.py", line 364, in wrap_socket
    _context=self)

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/ssl.py", line 578, in __init__
    self.do_handshake()

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/ssl.py", line 805, in do_handshake
    self._sslobj.do_handshake()

ssl.SSLError: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:600)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/adapters.py", line 370, in send
    timeout=timeout

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 574, in urlopen
    raise SSLError(e)

requests.packages.urllib3.exceptions.SSLError: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:600)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):

  File "/home/eric/unitedstates/inspectors-general/inspectors/utils/utils.py", line 154, in download
    response = scraper.get(url)

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/sessions.py", line 477, in get
    return self.request('GET', url, **kwargs)

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/scrapelib/__init__.py", line 270, in request
    **kwargs)

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/scrapelib/cache.py", line 51, in request
    resp = super(CachingSession, self).request(method, url, **kwargs)

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/scrapelib/__init__.py", line 92, in request
    return super(ThrottledSession, self).request(method, url, **kwargs)

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/scrapelib/__init__.py", line 157, in request
    resp = super(RetrySession, self).request(method, url, **kwargs)

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/sessions.py", line 465, in request
    resp = self.send(prep, **send_kwargs)

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/sessions.py", line 573, in send
    r = adapter.send(request, **kwargs)

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/adapters.py", line 431, in send
    raise SSLError(e, request=request)

requests.exceptions.SSLError: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:600)

Traceback (most recent call last):

  File "/home/eric/unitedstates/inspectors-general/inspectors/utils/utils.py", line 62, in run
    return run_method(cli_options)

  File "./inspectors/doj.py", line 512, in run
    content = get_content(starting_point)

  File "./inspectors/doj.py", line 492, in get_content
    page = BeautifulSoup(page)

  File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/bs4/__init__.py", line 162, in __init__
    elif len(markup) <= 256:

TypeError: object of type 'NoneType' has no len()

Is it possible I've miscompiled Python without an openssl lib (it's a new computer), or is there some fanciness going on now? I think you're much farther ahead than I on the state of SSLv3 and Python.

divergentdave commented 9 years ago

Ah yes, thanks for reminding me. That would be because of the ciphersuite issue in #228, which is what originally sent me down the urllib3 rabbit hole. A proper fix will have to wait for shazow/urllib3#507 to land and get picked up by requests, then I can add a TransportAdapter like so.

class TlsRc4HttpAdapter(requests.adapters.HTTPAdapter):
  """Transport adapter that re-enables use of RC4 ciphersuites. The Department
  of Justice server only supports TLS_RSA_WITH_RC4_128_SHA. Since v.1.10.2,
  urllib3 does not support RC4 ciphersiutes by default, because RC4 has been
  deprecated. This adapter restores the default cipher suite from the Python
  standard library."""

  def init_poolmanager(self, connections, maxsize, block=False):
    from ssl import _DEFAULT_CIPHERS
    from requests.packages.urllib3.util.ssl_ import create_urllib3_context
    ctx = create_urllib3_context(ciphers=_DEFAULT_CIPHERS)
    super(TlsRc4HttpAdapter, self).init_poolmanager(connections, maxsize, block
                                                    ssl_context=ctx)

scraper.mount("https://oig.justice.gov", TlsRc4HttpAdapter())
scraper.mount("http://www.justice.gov", TlsRc4HttpAdapter())

In the meantime, I could monkey-patch requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS back to Python's defaults. That will get things working, but it isn't as surgical.

divergentdave commented 9 years ago

ping How's this?

konklone commented 9 years ago

Works like a charm (though I'm sad about RC4). Thanks, @divergentdave!