python / cpython

The Python programming language
https://www.python.org
Other
63.88k stars 30.58k forks source link

OpenSSL 3.0 performance issue: SSLContext.set_default_verify_paths / load_verify_locations about 5x slower #95031

Open fcfangcc opened 2 years ago

fcfangcc commented 2 years ago

Bug report Example code in ubuntu20.04(openssl1.1) is much faster than ubuntu22.04(openssl3.x) Not just speed, CPU occupancy ubuntu22.04(openssl3.x) is many times of ubuntu20.04(openssl1.1) I'm not sure whether it's OpenSSL or Python adaptation problem

import socket
import ssl
import time

import certifi
hostname = 'www.python.org'  # any support https hostname
times = 100
pem_where = certifi.where()
context = ssl.create_default_context()
verify_total_time = 0

for i in range(times):
    with socket.create_connection((hostname, 443)) as sock:
        with context.wrap_socket(sock, server_hostname=hostname) as ssock:
            verify_start_time = time.time()
            context.load_verify_locations(pem_where)
            verify_total_time += time.time() - verify_start_time
            ssock.version()

print(f"total {verify_total_time:.4f}, avg {verify_total_time/times:.4f}")

in my environment with docker:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 3321 root      20   0  304140  81148  12792 S  42.0   0.8   0:29.69 ipython    (ubuntu22.04)
 3850 root      20   0  203348  52632  11576 S  16.7   0.5   0:06.34 ipython    (ubuntu20.04)
total 5.8634, avg 0.0586  (ubuntu22.04)
total 0.6753, avg 0.0068  (ubuntu20.04)

Your environment

tiran commented 2 years ago

It is a problem in OpenSSL 3.0. Python upstream does not support OpenSSL 3.0 for good reasons. It has performance and backwards compatibility issue. On my system load_verify_locations is about 5 times slower when using system certificates.

3.12.0a0 (heads/main-dirty:88e4eeba25d, Jul 20 2022, 08:22:18) [GCC 12.1.1 20220507 (Red Hat 12.1.1-1)]
OpenSSL 3.0.5 5 Jul 2022
100 loops of 'load_verify_locations' in 3.965sec
3.12.0a0 (heads/main-dirty:88e4eeba25d, Jul 20 2022, 08:22:18) [GCC 12.1.1 20220507 (Red Hat 12.1.1-1)]
OpenSSL 1.1.1n  15 Mar 2022
100 loops of 'load_verify_locations' in 0.871sec

By the way you should not combine ssl.create_default_context() with certifi. A default context already loads the system cert store.

tiran commented 2 years ago
import ssl
import sys
import time

LOOPS = 100

print(sys.version)
print(ssl.OPENSSL_VERSION)

ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)

start = time.monotonic()
for i in range(LOOPS):
    ctx.load_verify_locations('/etc/pki/tls/cert.pem')
dur = time.monotonic() - start
print(f"{LOOPS} loops of 'load_verify_locations' in {dur:0.3f}sec")
fcfangcc commented 2 years ago

Thansk.Example is separate from requests,httpx...... It seems that using ubuntu22 with Python is not a good idea, it`s default openssl3.

tiran commented 2 years ago

I recommend that you raise a bug with OpenSSL. Their SSL_CTX_set_default_verify_paths and SSL_CTX_load_verify_locations functions are much slower in 3.0 than in 1.1.1.

tiran commented 2 years ago

According to "perf', OpenSSL 3.0 is spending a lot of time in pthread_rwlock lock/unlock followed by sa_doall, getrn, and several libcrypto string functions (ossl_lh_strcasehash, ossl_tolower, OPENSSL_strcasecmp, OPENSSL_sk_value).

arhadthedev commented 2 years ago

a lot of time in pthread_rwlock lock/unlock

Fortunately, the issue is known (so no need to report it once more): openssl/openssl#16791 initial report and openssl/openssl#18814 pointing to a root issue.

iritkatriel commented 2 years ago

Should we close this as a third party issue?

risicle commented 1 year ago

As an outsider, it appears to me that there's always going to be a risk of this becoming a severe performance bottleneck as long as python's ssl API doesn't expose the ability to re-use an X509_STORE - forcing the system's CA bundle to be re-parsed for every new SSLContext.

ThomasChr commented 1 year ago

This one bit me today. And it did bite quite hard!

Beginning with cPython 3.11.5 (on Windows) we're shipping OpenSSL 3 instead of 1.1 So when I installed 3.12.0 on a customer system it took a few minutes until I got a call. The system is blocked, one Python Process takes all of the CPU and no one can work anymore. Turns out that my Python Process which uses 32 Threads and does nothing more then sending some simple numbers into the Internet took all of the CPU - instead of 20% as it was with Python 3.11. Thats a very bad performance regression.

One of my other processes has some more accurate numbers - before:

15 seconds total, 1 second on CPU

and after:

30 seconds total, 15 seconds on CPU

Pretty sure you could have cooked coffee on the cpu after that. (With cProfile you could see most of the CPU time in {method 'load_verify_locations' of '_ssl._SSLContext' objects})

On the upper level I'm using the requests library in my code and there where two solutions to the problem:

  1. Add verify=False to all of the requests
  2. Install Python 3.11.4

I used Method 2, but now I need to stay on Python 3.11.4 for all eternity. Hopefully this will be fixed some day soon.

Hope this post helps other people with the same problem. I will see if I'll add some info to the linked OpenSSL Issues also.

ThomasChr commented 1 year ago

As a quick solution: It seems that the time is spent when OpenSSL verifys a certificate - can't we try to cache this verification? In my code I'm making thousands of requests to the exact same host - I don't need to verify it every time. It would be a solution for the user to add verify=False beginning with the second request to the same host himself - but that means he needs to find out the root course first, which is not an easy task.

I'm a little bit sad that people will say that Python is dog-slow when really it isn't our fault. But saying "not our fault" won't help much here. Having a plan B would be great. (I'm not entirely convinced that OpenSSL will fix this problem soon...)

risicle commented 1 year ago

(Another workaround - if you're only connecting to one or a few hosts you may find that you're able to put together a custom, extremely minimal ca bundle with only the root cert(s) you need. This will be much faster to parse, though will have the same caveats as pinning certificates)

ThomasChr commented 1 year ago

Also one can only use verify=True when sending important data like passwords or user data. Otherwise verify=False will do. This is not perfect but really testing the certificate in every connection won‘t be needed of the time.

tiran commented 1 year ago

Python has a cache for certificate verification: SSLContext. The simplest solution for your performance problem is a single SSLContext object for all your TLS client connection. Most client application only need a single SSLContext object during their lifetime. You configure the SSLContext according to your security profile, load the trust anchors, and then pass the object to your connection function. Then you have to pay the price for CA loading just once. SSLContext is thread and async-safe.

If you are using requests or httpx, then you want to make use of requests.Session or httpx.Client, too. They enable HTTP connection pooling, which speeds up multiple requests to the same host a lot.

ThomasChr commented 1 year ago

@tiran Using a Requests Session was great advice and it speeds up things considerably. But still we're cooking the cpu quite good.

This is Python 3.11 without a Requests Session:

Sent Article 1/1 105038 (took: 17.86s real time / 1.47s cpu time)

And Python 3.11 with a Requests Session:

Sent Article 1/1 105038 (took: 15.56s real time / 0.28s cpu time)

This is Python 3.12 without a Requests Session:

Sent Article 1/1 105038 (took: 36.96s real time / 29.86s cpu time)

And Python 3.12 with a Requests Session:

Sent Article 1/1 105038 (took: 17.84s real time / 7.64s cpu time)

I could live with that, so Python 3.12 is not a no-go anymore - thanks a lot for your advice!

mm-matthias commented 7 months ago

@tiran You've said here that

SSLContext is thread and async-safe.

Is this official? If yes, can it be added to the documentation?

I am asking, because the slowness with load_verify_locations pops up in downstream projects such as requests and botocore. Using a session to perform requests only helps so much, some issues prevail in all of these libraries and the only way to solve them seems to be to load the SSLContext just once and then share it across threads, urllib3's HTTPConnectionPools and other places. So it is important to know if the SSLContext can be freely shared between threads, e.g. see this question in a requests PR to tackle the problem.

fireattack commented 6 months ago
import ssl
import sys
import time

LOOPS = 100

print(sys.version)
print(ssl.OPENSSL_VERSION)

ctx = ssl.SSLContext(ssl.PROTOCOL_TLS_CLIENT)

start = time.monotonic()
for i in range(LOOPS):
    ctx.load_verify_locations('/etc/pki/tls/cert.pem')
dur = time.monotonic() - start
print(f"{LOOPS} loops of 'load_verify_locations' in {dur:0.3f}sec")

Others have mentioned how bad it is on Windows, but just to demonstrate it clearer here.

Using a slightly modified code based on this, loading site-packages/certifi/cacert.pem (284KB) which is what urllib3 loads every time it starts a new SSL context, it is 44x slower here on my WIndows 10 computer.

C:\sync\code\python\_gists>py -3.10 perf_ssl.py ssl
Python version:   3.10.5 (tags/v3.10.5:f377153, Jun  6 2022, 16:14:13) [MSC v.1929 64 bit (AMD64)]
OpenSSL version:  OpenSSL 1.1.1n  15 Mar 2022
Run "load_verify_locations()" on "site-packages/certifi/cacert.pem" 100 times...
100 loops of 'load_verify_locations' in 1.047sec

C:\sync\code\python\_gists>py -3.12 perf_ssl.py ssl
Python version:   3.12.1 (tags/v3.12.1:2305ca5, Dec  7 2023, 22:03:25) [MSC v.1937 64 bit (AMD64)]
OpenSSL version:  OpenSSL 3.0.11 19 Sep 2023
Run "load_verify_locations()" on "site-packages/certifi/cacert.pem" 100 times...
100 loops of 'load_verify_locations' in 46.735sec