openaustralia / morph

Take the hassle out of web scraping
https://morph.io
GNU Affero General Public License v3.0
462 stars 74 forks source link

Python scraper with no changes started failing with ssl error 'certificate verify failed' on 1 Oct 2021 #1276

Open cofiem opened 2 years ago

cofiem commented 2 years ago

A scraper I built started failing with an SSL error on 1 Oct 2021.

I'm not sure how to fix this?

Injecting configuration and compiling...
       -----> Python app detected
-----> Installing python-3.6.2
-----> Installing pip
-----> Installing requirements with pip
       Collecting lxml==4.6.3
       Downloading lxml-4.6.3-cp36-cp36m-manylinux2014_x86_64.whl (6.3 MB)
       Collecting requests==2.26.0
       Downloading requests-2.26.0-py2.py3-none-any.whl (62 kB)
       Collecting certifi>=2017.4.17
       Downloading certifi-2021.10.8-py2.py3-none-any.whl (149 kB)
       Collecting charset-normalizer~=2.0.0
       Downloading charset_normalizer-2.0.7-py3-none-any.whl (38 kB)
       Collecting idna<4,>=2.5
       Downloading idna-3.3-py3-none-any.whl (61 kB)
       Collecting urllib3<1.27,>=1.21.1
       Downloading urllib3-1.26.7-py2.py3-none-any.whl (138 kB)
       Installing collected packages: urllib3, idna, charset-normalizer, certifi, requests, lxml
       Successfully installed certifi-2021.10.8 charset-normalizer-2.0.7 idna-3.3 lxml-4.6.3 requests-2.26.0 urllib3-1.26.7
       
       -----> Discovering process types
       Procfile declares types -> scraper
Injecting scraper and running...
Reading petition list
Traceback (most recent call last):
  File "/app/.heroku/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 706, in urlopen
    chunked=chunked,
  File "/app/.heroku/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 382, in _make_request
    self._validate_conn(conn)
  File "/app/.heroku/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 1010, in _validate_conn
    conn.connect()
  File "/app/.heroku/python/lib/python3.6/site-packages/urllib3/connection.py", line 426, in connect
    tls_in_tls=tls_in_tls,
  File "/app/.heroku/python/lib/python3.6/site-packages/urllib3/util/ssl_.py", line 450, in ssl_wrap_socket
    sock, context, tls_in_tls, server_hostname=server_hostname
  File "/app/.heroku/python/lib/python3.6/site-packages/urllib3/util/ssl_.py", line 493, in _ssl_wrap_socket_impl
    return ssl_context.wrap_socket(sock, server_hostname=server_hostname)
  File "/app/.heroku/python/lib/python3.6/ssl.py", line 401, in wrap_socket
    _context=self, _session=session)
  File "/app/.heroku/python/lib/python3.6/ssl.py", line 808, in __init__
    self.do_handshake()
  File "/app/.heroku/python/lib/python3.6/ssl.py", line 1061, in do_handshake
    self._sslobj.do_handshake()
  File "/app/.heroku/python/lib/python3.6/ssl.py", line 683, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:748)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/.heroku/python/lib/python3.6/site-packages/requests/adapters.py", line 449, in send
    timeout=timeout
  File "/app/.heroku/python/lib/python3.6/site-packages/urllib3/connectionpool.py", line 756, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/app/.heroku/python/lib/python3.6/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='epetitions.brisbane.qld.gov.au', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:748)'),))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scraper.py", line 254, in <module>
    petitions.run()
  File "scraper.py", line 42, in run
    petition_list_page = self.download_html(self.petition_list)
  File "scraper.py", line 208, in download_html
    page = requests.get(url)
  File "/app/.heroku/python/lib/python3.6/site-packages/requests/api.py", line 75, in get
    return request('get', url, params=params, **kwargs)
  File "/app/.heroku/python/lib/python3.6/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/app/.heroku/python/lib/python3.6/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/app/.heroku/python/lib/python3.6/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/app/.heroku/python/lib/python3.6/site-packages/requests/adapters.py", line 514, in send
    raise SSLError(e, request=request)
requests.exceptions.SSLError: HTTPSConnectionPool(host='epetitions.brisbane.qld.gov.au', port=443): Max retries exceeded with url: / (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:748)'),))
cofiem commented 2 years ago

Ah, the problem might be the Let's Encrypt certificate expiration on 30 Sept 2021.

https://letsencrypt.org/docs/dst-root-ca-x3-expiration-september-2021/