Closed divergentdave closed 9 years ago
It looks good, and thanks for doing a once-over on fixes here. I got this error, though:
$ ./inspectors/doj.py --since=2014 --debug
https://oig.justice.gov/reports/components.htm
## Downloading: https://oig.justice.gov/reports/components.htm
GET - https://oig.justice.gov/reports/components.htm
Starting new HTTPS connection (1): oig.justice.gov
Error downloading https://oig.justice.gov/reports/components.htm:
Traceback (most recent call last):
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 544, in urlopen
body=body, headers=headers)
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 341, in _make_request
self._validate_conn(conn)
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 761, in _validate_conn
conn.connect()
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 238, in connect
ssl_version=resolved_ssl_version)
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/packages/urllib3/util/ssl_.py", line 279, in ssl_wrap_socket
return context.wrap_socket(sock, server_hostname=server_hostname)
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/ssl.py", line 364, in wrap_socket
_context=self)
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/ssl.py", line 578, in __init__
self.do_handshake()
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/ssl.py", line 805, in do_handshake
self._sslobj.do_handshake()
ssl.SSLError: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:600)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/adapters.py", line 370, in send
timeout=timeout
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 574, in urlopen
raise SSLError(e)
requests.packages.urllib3.exceptions.SSLError: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:600)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/eric/unitedstates/inspectors-general/inspectors/utils/utils.py", line 154, in download
response = scraper.get(url)
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/sessions.py", line 477, in get
return self.request('GET', url, **kwargs)
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/scrapelib/__init__.py", line 270, in request
**kwargs)
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/scrapelib/cache.py", line 51, in request
resp = super(CachingSession, self).request(method, url, **kwargs)
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/scrapelib/__init__.py", line 92, in request
return super(ThrottledSession, self).request(method, url, **kwargs)
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/scrapelib/__init__.py", line 157, in request
resp = super(RetrySession, self).request(method, url, **kwargs)
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/requests/adapters.py", line 431, in send
raise SSLError(e, request=request)
requests.exceptions.SSLError: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:600)
Traceback (most recent call last):
File "/home/eric/unitedstates/inspectors-general/inspectors/utils/utils.py", line 62, in run
return run_method(cli_options)
File "./inspectors/doj.py", line 512, in run
content = get_content(starting_point)
File "./inspectors/doj.py", line 492, in get_content
page = BeautifulSoup(page)
File "/home/eric/.pyenv/versions/3.4.2/lib/python3.4/site-packages/bs4/__init__.py", line 162, in __init__
elif len(markup) <= 256:
TypeError: object of type 'NoneType' has no len()
Is it possible I've miscompiled Python without an openssl lib (it's a new computer), or is there some fanciness going on now? I think you're much farther ahead than I on the state of SSLv3 and Python.
Ah yes, thanks for reminding me. That would be because of the ciphersuite issue in #228, which is what originally sent me down the urllib3 rabbit hole. A proper fix will have to wait for shazow/urllib3#507 to land and get picked up by requests, then I can add a TransportAdapter like so.
class TlsRc4HttpAdapter(requests.adapters.HTTPAdapter):
"""Transport adapter that re-enables use of RC4 ciphersuites. The Department
of Justice server only supports TLS_RSA_WITH_RC4_128_SHA. Since v.1.10.2,
urllib3 does not support RC4 ciphersiutes by default, because RC4 has been
deprecated. This adapter restores the default cipher suite from the Python
standard library."""
def init_poolmanager(self, connections, maxsize, block=False):
from ssl import _DEFAULT_CIPHERS
from requests.packages.urllib3.util.ssl_ import create_urllib3_context
ctx = create_urllib3_context(ciphers=_DEFAULT_CIPHERS)
super(TlsRc4HttpAdapter, self).init_poolmanager(connections, maxsize, block
ssl_context=ctx)
scraper.mount("https://oig.justice.gov", TlsRc4HttpAdapter())
scraper.mount("http://www.justice.gov", TlsRc4HttpAdapter())
In the meantime, I could monkey-patch requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS
back to Python's defaults. That will get things working, but it isn't as surgical.
ping How's this?
Works like a charm (though I'm sad about RC4). Thanks, @divergentdave!
The OIG's website has moved from http://www.justice.gov/oig/ to https://oig.justice.gov/