python / cpython

The Python programming language
https://www.python.org
Other
63.18k stars 30.25k forks source link

Urllib default SSL context erratic in Windows #101738

Open bitlogik opened 1 year ago

bitlogik commented 1 year ago

On Windows 10. I mostly expect that Windows 8.1 and 11 are performing the same.

Python 3.9.11 and 3.10.9 are affected for sure. I think all recent Python versions are affected.

How to Reproduce ?

Delete all the following certificates from the Computer certificates manager (and the user certificate manager)

I think we can also start from a brand new Windows machine with the bare minimal dozen roots.

At the end (starting point to reproduce) is not an exotic Windows configuration, but a pretty standard one. We initially investigate on this matter as a customer faces this issue on one of our commercial Python bundled application.

Open in Edge : https://valid-isrgrootx1.letsencrypt.org/

Now runs this snippet

from urllib import request

domain = "https://valid-isrgrootx1.letsencrypt.org"
req = request.Request(domain)
resp = request.urlopen(req)

Sometimes its OK Sometimes it throws urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>

That's the issue. The behavior is not even the same between Python sessions.

Investigation

The "divergence" seems to be during loading. Like the certificates store is loaded once when Python starts, probably during import urllib. I saw that because on an affected app which communicates with various domains, it is "all or nothing". Sometimes the app starts and all HTTPS calls are OK. And some other start, all the calls to HTTPS to some web domains are failing (cert expired).

To see more information I did :

from urllib import request
import ssl

domain = "https://valid-isrgrootx1.letsencrypt.org"
req = request.Request(domain)
ctx = ssl.SSLContext()
ctx.load_default_certs()
print(ctx.cert_store_stats())
resp = request.urlopen(req, context=ctx)

But it doesn't bring the issue. That means this is a different behaviour from when the context is loaded "automatically by default" (no context provided). At least that provide a workaround of this issue. But many developer just use plain urlopen. I mean it complexifies all the queries as one needs to import ssl and load a default context.

So the "automatic load by default" (context is None) code path in Windows is kind of faulty, as it seems it can't see the same certificates at launch time.

My investigations suggests it happens when a website TLS proxy handle the new ISRG Root X1 and also the old DST Root CA X3. The problem is that there is a lot of such web sites or APIs.

Interesting point with "valid-isrgrootx1.letsencrypt.org" is that the expired chain it refers to, is a chain for a different domains. That means it check for certificate validity before checking for domain validity (if it does so).

Additionally if one deletes the DST Root CA X3 certificates, the error is similar it just tell about "certificate verify failed: unable to get issuer certificate" which is a variant of this issue. During some start, it can't see the valid full cert path with the new ISRG X1.

What is the more mysterious in this issue is that during some Python program start, it behaves well. And during some others session, it doesn't (like ignoring the IRSG1 new certificate). It feels like Python openssl doesn't always load the certificate store the same way.

Is there a way to reach the context of the request? Something like resp._sock._ctx ?

This issue is very close to https://github.com/python/cpython/issues/89535 but different: the certificate store is not changing between run sessions, it is only how Python openssl loads the certificates. And the behavior is erratic : sometimes the store loading is OK, sometimes partial and provides a bad default ssl context. One expects the default SSLcontext calls load_default_certs() and behaves same, but this is not the case. It is different, and from one start to an other, it doesn't behaves the same.

I feel the issue lies in openssl. As Python chose to use and integrates openssl, this is an isue in Python. Also this affects developers (Python users), as one default loading is not working properly.

We'll continue to investigate, focusing only on openssl.

bitlogik commented 1 year ago

Some variations samples.

from urllib import request
domain = "https://valid-isrgrootx1.letsencrypt.org"
resp = request.urlopen(domain)

🐛 Affected

from urllib import request
domain = "https://valid-isrgrootx1.letsencrypt.org"
req = request.Request(domain)
resp = request.urlopen(req)

🐛 Affected

from urllib import request
import ssl
domain = "https://valid-isrgrootx1.letsencrypt.org"
req = request.Request(domain)
ctx = ssl.create_default_context()
resp = request.urlopen(req, context=ctx)

🐛 Affected

from urllib import request
import ssl
domain = "https://valid-isrgrootx1.letsencrypt.org"
req = request.Request(domain)
ctx = ssl.SSLContext()
ctx.load_default_certs()
resp = request.urlopen(req, context=ctx)

✅ Work (always OK)

The seems to be in create_default_context(), but the bug triggers during the initial load (import?) of the Python program. Then the queries behavior are all the same during the program execution (every time OK, or every time error when LetsEncrypt).

Here I use valid-isrgrootx1.letsencrypt.org but that is the exact same with any similar TLS certs path. In a way, all LetsEncrypt web endpoints brings this issue, and that means a lot.

bitlogik commented 1 year ago

And exact same issue with

context = ssl.create_default_context()
sock = socket.create_connection((domain, port))
self.conn = context.wrap_socket(sock, server_hostname=domain)
bitlogik commented 1 year ago

Going deeper

domain = "https://valid-isrgrootx1.letsencrypt.org"
req = request.Request(domain)
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
ctx.load_default_certs()
resp = request.urlopen(req, context=ctx)

🐛 Affected

domain = "https://valid-isrgrootx1.letsencrypt.org"
req = request.Request(domain)
ctx = ssl.SSLContextssl.PROTOCOL_TLS)
ctx.load_default_certs()
resp = request.urlopen(req, context=ctx)

✅ Work (always OK)

Notice the protocol is TLS in the second case, adn TLS_CLIENT in the broken case.

I guess the TLS is only providing basic checks, hence it works always right. Still the mystery is why using TLS_CLIENT, it works some time, and others run it doesn't ?

bitlogik commented 1 year ago

Here'a a basic example illustrating the "erratic" part.

import ssl
from urllib import request

domain = "https://valid-isrgrootx1.letsencrypt.org"
req = request.Request(domain)
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
ctx.load_default_certs()
try:
    resp = request.urlopen(req, context=ctx)
    print("OK")
except:
    print("Oh no")

Run every 1 or 2 seconds

image

The exception being : urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>

And the difference is not triggered by load_default_certs(), nor when connecting at the server. But triggered at the beginning of the Python application, when it loads.

bitlogik commented 1 year ago

Very weird behavior

import ssl

import logging
logging.basicConfig(level=0)

from urllib import request

print(ssl.OPENSSL_VERSION)

domain = "https://valid-isrgrootx1.letsencrypt.org"
req = request.Request(domain)
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
ctx.load_default_certs()
print(ctx.cert_store_stats())
try:
    resp = request.urlopen(req, context=ctx)
    print("OK")
except:
    print("Oh no")
    raise

Results

>python39 min_rep.py
OpenSSL 1.1.1n  15 Mar 2022
{'x509': 126, 'crl': 0, 'x509_ca': 124}
OK

>python39 min_rep.py
OpenSSL 1.1.1n  15 Mar 2022
{'x509': 126, 'crl': 0, 'x509_ca': 124}
Oh no
Traceback (most recent call last):
  File "C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 1346, in do_open
    h.request(req.get_method(), req.selector, req.data, headers,
  File "C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1285, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1331, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1280, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1040, in _send_output
    self.send(msg)
  File "C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 980, in send
    self.connect()
  File "C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\http\client.py", line 1454, in connect
    self.sock = self._context.wrap_socket(self.sock,
  File "C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\ssl.py", line 500, in wrap_socket
    return self.sslsocket_class._create(
  File "C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\ssl.py", line 1040, in _create
    self.do_handshake()
  File "C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\ssl.py", line 1309, in do_handshake
    self._sslobj.do_handshake()
ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "min_rep.py", line 32, in <module>
    resp = request.urlopen(req, context=ctx)
  File "C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 214, in urlopen
    return opener.open(url, data, timeout)
  File "C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 517, in open
    response = self._open(req, data)
  File "C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 534, in _open
    result = self._call_chain(self.handle_open, protocol, protocol +
  File "C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 494, in _call_chain
    result = func(*args)
  File "C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 1389, in https_open
    return self.do_open(http.client.HTTPSConnection, req,
  File "C:\Users\USER\AppData\Local\Programs\Python\Python39\lib\urllib\request.py", line 1349, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1129)>

I see the Windows certificate store reading is not part of openssl, but in cpython. There's an issue where the store is loaded, or when the information are transmitted to openssl. Sometimes it works, sometimes not.

This can be also the proper checks are not performed when it works. So in this case, the issue is opposite as we can think.

bitlogik commented 1 year ago

I can't dig and help further, as this becomes deep in cpython. For example, _enumcertificates is in C, in a very large C file dedicated to SSL. https://github.com/python/cpython/blob/3.9/Modules/_ssl.c#L5724

bitlogik commented 1 year ago

Testing to see if one can see any difference when it's OK or not.

import ssl
import logging

logging.basicConfig(level=0)

from urllib import request

print(ssl.OPENSSL_VERSION)

domain = "https://valid-isrgrootx1.letsencrypt.org"
req = request.Request(domain)
ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)
ctx.load_default_certs()
print(ctx.cert_store_stats())
print(ctx.options)
print(ctx.verify_flags)
print(ctx.verify_mode)
print(ssl.HAS_ALPN)
print(ssl.HAS_NEVER_CHECK_COMMON_NAME)
print(ssl.HAS_ALPN)
print(ssl.HAS_ECDH)
print(ssl.HAS_SNI)
print(ssl.HAS_NPN)
print(ssl.HAS_SSLv2)
print(ssl.HAS_SSLv3)
print(ssl.HAS_TLSv1)
print(ssl.HAS_TLSv1_1)
print(ssl.HAS_TLSv1_2)
print(ssl.HAS_TLSv1_3)
print(ssl.CHANNEL_BINDING_TYPES)
try:
    resp = request.urlopen(req, context=ctx)
    print("OK")
except:
    print("Oh no")
>python39 min_rep.py
OpenSSL 1.1.1n  15 Mar 2022
{'x509': 126, 'crl': 0, 'x509_ca': 124}
Options.OP_NO_SSLv3|OP_CIPHER_SERVER_PREFERENCE|OP_ENABLE_MIDDLEBOX_COMPAT|OP_NO_COMPRESSION|OP_ALL
VerifyFlags.VERIFY_X509_TRUSTED_FIRST
VerifyMode.CERT_REQUIRED
True
True
True
True
True
False
False
False
True
True
True
True
['tls-unique']
OK

 Retry 2 second later.
>python39 min_rep.py
OpenSSL 1.1.1n  15 Mar 2022
{'x509': 126, 'crl': 0, 'x509_ca': 124}
Options.OP_NO_SSLv3|OP_CIPHER_SERVER_PREFERENCE|OP_ENABLE_MIDDLEBOX_COMPAT|OP_NO_COMPRESSION|OP_ALL
VerifyFlags.VERIFY_X509_TRUSTED_FIRST
VerifyMode.CERT_REQUIRED
True
True
True
True
True
False
False
False
True
True
True
True
['tls-unique']
Oh no

"There are the exact same picture"

I start to think about some possible causes :

bitlogik commented 1 year ago

On one software, which only queries one external API domain, we decided yesterday to include the ISRG X1 root in the bundle. And that works perfectly. Openssl is not confused when there is only one root certificate.

bitlogik commented 1 year ago

Concerning the Python versions

`> python36 min_rep.py

OpenSSL 1.0.2q  20 Nov 2018
{'x509': 98, 'crl': 0, 'x509_ca': 96}
Options.OP_NO_SSLv3|OP_NO_SSLv2|OP_CIPHER_SERVER_PREFERENCE|OP_SINGLE_DH_USE|OP_SINGLE_ECDH_USE|OP_NO_COMPRESSION|OP_ALL
VerifyFlags.VERIFY_X509_TRUSTED_FIRST
VerifyMode.CERT_REQUIRED
has NPN
['tls-unique']

Always OK

`> python39 min_rep.py

OpenSSL 1.1.1n  15 Mar 2022
{'x509': 128, 'crl': 0, 'x509_ca': 126}
Options.OP_NO_SSLv3|OP_CIPHER_SERVER_PREFERENCE|OP_ENABLE_MIDDLEBOX_COMPAT|OP_NO_COMPRESSION|OP_ALL
VerifyFlags.VERIFY_X509_TRUSTED_FIRST
VerifyMode.CERT_REQUIRED
No NPN
['tls-unique']

Affected

`> python310 min_rep.py

OpenSSL 1.1.1s  1 Nov 2022
{'x509': 128, 'crl': 0, 'x509_ca': 126}
Options.OP_NO_SSLv3|OP_CIPHER_SERVER_PREFERENCE|OP_ENABLE_MIDDLEBOX_COMPAT|OP_NO_COMPRESSION|OP_ALL
VerifyFlags.VERIFY_X509_TRUSTED_FIRST
VerifyMode.CERT_REQUIRED
No NPN
['tls-unique']

Affected

Python36 always connects right. It uses a lower amount of certificates. Probably because OpenSSL 1.0.2 can manage/load less. Python3.9 and 3.10 behave the same. Some queries are rejected.

bitlogik commented 1 year ago

After cleaning the Current User Intermediates Certificates from LetEncrypt ISRG and DST, it works right. Not so easy to trigger this issue. It is more robust than actually sought. Anyway, that could be good to investigate further, and eventually make the certificate loading deterministic : sorting the loaded certificates if needed, so the TLS behaviour is the same across all Python runs.