python / cpython

The Python programming language
https://www.python.org
Other
62.85k stars 30.1k forks source link

SIGSEGV on macOS when calling ssl.SSLContext.load_verify_location during a shutdown #114653

Open orf opened 8 months ago

orf commented 8 months ago

Crash report

What happened?

On MacOS, calling ssl.SSLContext.load_verify_locations from multiple threads causes a SIGSEGV during process termination when an unhandled exception is raised.

I was able to create a repoduction case and I've attached the five MacOS crash reports here.

Reproduction

from multiprocessing.pool import ThreadPool
import ssl
import certifi

ctx = ssl.create_default_context()
location = certifi.where()

def test_func(_):
    ctx.load_verify_locations(location)

with ThreadPool(processes=100) as pool:
    for idx, item in enumerate(pool.imap_unordered(test_func, range(1000), chunksize=1)):
        if idx == 100:
            raise RuntimeError()
Traceback (most recent call last):
  File "/Users/tomforbes/tmp/scanner/repo.py", line 16, in <module>
    raise RuntimeError()
RuntimeError
fish: Job 1, 'python repo.py' terminated by signal SIGSEGV (Address boundary error)

MacOS crash reports:

five.txt four.txt one.txt three.txt two.txt

CPython versions tested on:

3.11

Operating systems tested on:

macOS

Output from running 'python -VV' on the command line:

Python 3.11.7 (v3.11.7:fa7a6f2303, Dec 4 2023, 15:22:56) [Clang 13.0.0 (clang-1300.0.29.30)]

orf commented 8 months ago

I got it to reliably crash again, and it seems that it's related to an exception being thrown in one thread whilst many other threads are inside _ssl.

Upping the number of open files prevents the initial exception from happening, but whenever an exception is thrown python will quit with a SIGSEGV.

Here is a crash due to a ValueError being thrown in the main thread, triggered when trying to serialize an invalid object using json.dumps:

six.txt

serhiy-storchaka commented 8 months ago

Is it possible to reproduce without using third-party libraries?

orf commented 8 months ago

I can try but I’m not sure if I will be able to and I don’t have a gigantic amount of time I can spend digging into this.

Looking at the thread traces, there don’t appear to be any stack traces that are in 3rd party native modules - it’s all in CPython or ssl internals. But the issue may be triggered by how boto3 implements connection pooling, which is something that I’m not familiar with. I’ll give it a go.

If I’m able to reproduce this again, is there any more information (outside of a reproducible test case) that would help with this?

I tried enabling faulthandler, but it didn’t output anything which makes me think it’s something that is erroring after Python has shut down.

serhiy-storchaka commented 8 months ago

Can this be reproduced without tqdm? Can this be reproduced with smaller number of threads or tasks? How many tasks were finished at the moment of the crash, how many tasks were started, and how much data the unfinished tasks read?

orf commented 8 months ago

I've managed to create a reproduction case:

from multiprocessing.pool import ThreadPool
import ssl
import certifi

ctx = ssl.create_default_context()
location = certifi.where()

def test_func(_):
    ctx.load_verify_locations(location)

with ThreadPool(processes=100) as pool:
    for idx, item in enumerate(pool.imap_unordered(test_func, range(1000), chunksize=1)):
        if idx == 100:
            raise RuntimeError()

On MacOS, this reliably causes it to trigger:

Traceback (most recent call last):
  File "/Users/tomforbes/tmp/scanner/repo.py", line 16, in <module>
    raise RuntimeError()
RuntimeError
fish: Job 1, 'python repo.py' terminated by signal SIGSEGV (Address boundary error)

if you up the threads to something silly like 10k, it errors immediately:

Traceback (most recent call last):
  File "/Users/tomforbes/tmp/scanner/repo.py", line 14, in <module>
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 873, in next
  File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 125, in worker
  File "/Users/tomforbes/tmp/scanner/repo.py", line 10, in test_func
OSError: [Errno 24] Too many open files
fish: Job 1, 'python repo.py' terminated by signal SIGSEGV (Address boundary error)
orf commented 8 months ago

Using exit() causes the error, but os._exit(0) doesn't. Judging by this issue, it could be because OpenSSL is being shutdown/cleaned up whilst threads are waiting on a lock? Or something?

ronaldoussoren commented 8 months ago

I get the following stack trace when testing with a recentish checkout of main and openssl 3.0.12. OpenSSL was build without debug information, I'll have to rebuild it with debug information to get more information.

* thread #15, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x000000018a2268a0 libsystem_pthread.dylib`pthread_rwlock_unlock
    frame #1: 0x0000000101ae605c libcrypto.3.dylib`CRYPTO_THREAD_unlock + 12
    frame #2: 0x0000000101a96058 libcrypto.3.dylib`ENGINE_pkey_asn1_find_str + 224
    frame #3: 0x00000001019d0170 libcrypto.3.dylib`EVP_PKEY_asn1_find_str + 116
    frame #4: 0x0000000101ac4648 libcrypto.3.dylib`find_ameth + 64
    frame #5: 0x0000000101ad5c28 libcrypto.3.dylib`ossl_namemap_doall_names + 156
    frame #6: 0x0000000101ac1904 libcrypto.3.dylib`EVP_PKEY_set_type_by_keymgmt + 68
    frame #7: 0x0000000101abb830 libcrypto.3.dylib`evp_keymgmt_util_make_pkey + 60
    frame #8: 0x0000000101a8f078 libcrypto.3.dylib`decoder_construct_pkey + 440
    frame #9: 0x0000000101a8c878 libcrypto.3.dylib`decoder_process + 192
  * frame #10: 0x0000000101b81c0c libcrypto.3.dylib`der2key_decode + 672
ronaldoussoren commented 8 months ago

Stack trace of the crashing thread with openssl debug info, with open 3.0.13:

* thread #28, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x000000018a2268a0 libsystem_pthread.dylib`pthread_rwlock_unlock
    frame #1: 0x0000000101fb20a8 _ssl.cpython-313d-darwin.so`CRYPTO_THREAD_unlock(lock=<unavailable>) at threads_pthread.c:125:9 [opt]
    frame #2: 0x0000000101f64ff4 _ssl.cpython-313d-darwin.so`ENGINE_pkey_asn1_find_str(pe=0x000000028b07dd38, str="2.5.8.1.1", len=9) at tb_asnmth.c:212:5 [opt]
    frame #3: 0x0000000101ec6670 _ssl.cpython-313d-darwin.so`EVP_PKEY_asn1_find_str(pe=0x000000028b07dd98, str="2.5.8.1.1", len=9) at ameth_lib.c:117:17 [opt]
    frame #4: 0x0000000101f90de0 _ssl.cpython-313d-darwin.so`find_ameth [inlined] pkey_set_type(pkey=0x0000000000000000, e=0x0000000000000000, type=0, str="2.5.8.1.1", len=<unavailable>, keymgmt=0x0000000000000000) at p_lib.c:1539:17 [opt]
    frame #5: 0x0000000101f90dd4 _ssl.cpython-313d-darwin.so`find_ameth(name="2.5.8.1.1", data=0x000000028b07de28) at p_lib.c:1618:9 [opt]
    frame #6: 0x0000000101fa1ae8 _ssl.cpython-313d-darwin.so`ossl_namemap_doall_names(namemap=<unavailable>, number=42697824, fn=(_ssl.cpython-313d-darwin.so`find_ameth at p_lib.c:1609), data=0x000000028b07de28) at core_namemap.c:171:9 [opt]
    frame #7: 0x0000000101f80e78 _ssl.cpython-313d-darwin.so`evp_names_do_all(prov=<unavailable>, number=<unavailable>, fn=<unavailable>, data=<unavailable>) at evp_fetch.c:702:12 [opt] [artificial]
    frame #8: 0x0000000101f893b4 _ssl.cpython-313d-darwin.so`EVP_KEYMGMT_names_do_all(keymgmt=<unavailable>, fn=<unavailable>, data=<unavailable>) at keymgmt_meth.c:302:16 [opt] [artificial]
    frame #9: 0x0000000101f8e09c _ssl.cpython-313d-darwin.so`EVP_PKEY_set_type_by_keymgmt(pkey=0x000060000607eee0, keymgmt=0x00006000028b8460) at p_lib.c:1642:10 [opt]
    frame #10: 0x0000000101f8817c _ssl.cpython-313d-darwin.so`evp_keymgmt_util_make_pkey [inlined] evp_keymgmt_util_assign_pkey(pkey=0x000060000607eee0, keymgmt=0x00006000028b8460, keydata=0x0000600006885960) at keymgmt_lib.c:64:13 [opt]
    frame #11: 0x0000000101f88170 _ssl.cpython-313d-darwin.so`evp_keymgmt_util_make_pkey(keymgmt=0x00006000028b8460, keydata=0x0000600006885960) at keymgmt_lib.c:80:13 [opt]
    frame #12: 0x0000000101f5e050 _ssl.cpython-313d-darwin.so`decoder_construct_pkey(decoder_inst=<unavailable>, params=<unavailable>, construct_data=0x0000600001ccc120) at decoder_pkey.c:172:24 [opt]
    frame #13: 0x0000000101f5b850 _ssl.cpython-313d-darwin.so`decoder_process(params=0x000000028b07e018, arg=0x000000028b07e118) at decoder_lib.c:738:18 [opt]
    frame #14: 0x0000000102040f6c _ssl.cpython-313d-darwin.so`der2key_decode(vctx=0x00006000050d52a0, cin=<unavailable>, selection=<unavailable>, data_cb=(_ssl.cpython-313d-darwin.so`decoder_process at decoder_lib.c:676), data_cbarg=0x000000028b07e118, pw_cb=<unavailable>, pw_cbarg=<unavailable>) at decode_der2key.c:299:14 [opt]
    frame #15: 0x0000000101f5baa4 _ssl.cpython-313d-darwin.so`decoder_process(params=<unavailable>, arg=0x000000028b07e358) at decoder_lib.c:963:14 [opt]
    frame #16: 0x0000000102043334 _ssl.cpython-313d-darwin.so`spki2typespki_decode(vctx=<unavailable>, cin=<unavailable>, selection=<unavailable>, data_cb=(_ssl.cpython-313d-darwin.so`decoder_process at decoder_lib.c:676), data_cbarg=0x000000028b07e358, pw_cb=<unavailable>, pw_cbarg=<unavailable>) at decode_spki2typespki.c:111:10 [opt]
    frame #17: 0x0000000101f5baa4 _ssl.cpython-313d-darwin.so`decoder_process(params=<unavailable>, arg=0x000000028b07e418) at decoder_lib.c:963:14 [opt]
    frame #18: 0x0000000101f5b680 _ssl.cpython-313d-darwin.so`OSSL_DECODER_from_bio(ctx=0x0000600007a3b5d0, in=0x00006000022a9200) at decoder_lib.c:81:10 [opt]
    frame #19: 0x0000000101f5bc68 _ssl.cpython-313d-darwin.so`OSSL_DECODER_from_data(ctx=0x0000600007a3b5d0, pdata=0x000000028b07e4d8, pdata_len=0x000000028b07e4d0) at decoder_lib.c:156:9 [opt]
    frame #20: 0x000000010202cd78 _ssl.cpython-313d-darwin.so`x509_pubkey_ex_d2i_ex(pval=<unavailable>, in=<unavailable>, len=362, it=<unavailable>, tag=-1, aclass=<unavailable>, opt='\0', ctx=<unavailable>, libctx=0x0000000000000000, propq=0x0000000000000000) at x_pubkey.c:217:17 [opt]
    frame #21: 0x0000000101ece74c _ssl.cpython-313d-darwin.so`asn1_template_noexp_d2i(val=0x000000011c6f3020, in=0x000000028b07e680, len=<unavailable>, tt=0x0000000102115de8, opt='\0', ctx=0x000000028b07e880, depth=2, libctx=0x0000000000000000, propq=0x0000000000000000) at tasn_dec.c:682:15 [opt]
    frame #22: 0x0000000101ecdaf4 _ssl.cpython-313d-darwin.so`asn1_template_ex_d2i(val=<unavailable>, in=<unavailable>, inlen=<unavailable>, tt=<unavailable>, opt=<unavailable>, ctx=<unavailable>, depth=<unavailable>, libctx=<unavailable>, propq=<no value available>) at tasn_dec.c:558:16 [opt] [artificial]
    frame #23: 0x0000000101ecd5bc _ssl.cpython-313d-darwin.so`asn1_item_embed_d2i(pval=0x000000028b07e730, in=0x000000028b07e728, len=<unavailable>, it=0x0000000102115cc0, tag=<unavailable>, aclass=<unavailable>, opt=<unavailable>, ctx=0x000000028b07e880, depth=2, libctx=0x0000000000000000, propq=0x0000000000000000) at tasn_dec.c:422:19 [opt]
    frame #24: 0x0000000101ece74c _ssl.cpython-313d-darwin.so`asn1_template_noexp_d2i(val=0x000000028b07e730, in=0x000000028b07e7f0, len=<unavailable>, tt=0x0000000102115ec0, opt='\0', ctx=0x000000028b07e880, depth=1, libctx=0x0000000000000000, propq=0x0000000000000000) at tasn_dec.c:682:15 [opt]
    frame #25: 0x0000000101ecdaf4 _ssl.cpython-313d-darwin.so`asn1_template_ex_d2i(val=<unavailable>, in=<unavailable>, inlen=<unavailable>, tt=<unavailable>, opt=<unavailable>, ctx=<unavailable>, depth=<unavailable>, libctx=<unavailable>, propq=<no value available>) at tasn_dec.c:558:16 [opt] [artificial]
    frame #26: 0x0000000101ecd5bc _ssl.cpython-313d-darwin.so`asn1_item_embed_d2i(pval=0x0000600000baa780, in=0x000000028b07e8f0, len=<unavailable>, it=0x0000000102115e88, tag=<unavailable>, aclass=<unavailable>, opt=<unavailable>, ctx=0x000000028b07e880, depth=1, libctx=0x0000000000000000, propq=0x0000000000000000) at tasn_dec.c:422:19 [opt]
    frame #27: 0x0000000101eccf6c _ssl.cpython-313d-darwin.so`ASN1_item_d2i [inlined] asn1_item_ex_d2i_intern(pval=0x0000600000baa780, in=<unavailable>, len=<unavailable>, it=0x0000000102115e88, tag=-1, aclass=0, opt='\0', ctx=0x000000028b07e880, libctx=0x0000000000000000, propq=0x0000000000000000) at tasn_dec.c:118:10 [opt]
    frame #28: 0x0000000101eccf44 _ssl.cpython-313d-darwin.so`ASN1_item_d2i [inlined] ASN1_item_d2i_ex(pval=0x0000600000baa780, in=<unavailable>, len=<unavailable>, it=0x0000000102115e88, libctx=0x0000000000000000, propq=0x0000000000000000) at tasn_dec.c:144:9 [opt]
    frame #29: 0x0000000101eccf30 _ssl.cpython-313d-darwin.so`ASN1_item_d2i(pval=0x0000600000baa780, in=<unavailable>, len=<unavailable>, it=0x0000000102115e88) at tasn_dec.c:154:12 [opt]
    frame #30: 0x000000010202d1bc _ssl.cpython-313d-darwin.so`d2i_X509(a=<unavailable>, in=<unavailable>, len=<unavailable>) at x_x509.c:138:1 [opt] [artificial]
    frame #31: 0x0000000101fc3888 _ssl.cpython-313d-darwin.so`PEM_X509_INFO_read_bio_ex(bp=0x00006000022b2eb0, sk=0x0000000000000000, cb=<unavailable>, u=<unavailable>, libctx=0x0000000000000000, propq=<unavailable>) at pem_info.c:168:28 [opt]
    frame #32: 0x000000010200237c _ssl.cpython-313d-darwin.so`X509_load_cert_crl_file_ex(ctx=0x00006000011a6da0, file=<unavailable>, type=1, libctx=0x0000000000000000, propq=0x0000000000000000) at by_file.c:231:11 [opt]
    frame #33: 0x00000001020025b0 _ssl.cpython-313d-darwin.so`by_file_ctrl_ex(ctx=0x00006000011a6da0, cmd=<unavailable>, argp=<unavailable>, argl=<unavailable>, ret=<unavailable>, libctx=0x0000000000000000, propq=0x0000000000000000) at by_file.c:72:23 [opt]
    frame #34: 0x000000010201d088 _ssl.cpython-313d-darwin.so`X509_STORE_load_file_ex(ctx=0x00006000020bc1e0, file="/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/certifi/cacert.pem", libctx=0x0000000000000000, propq=0x0000000000000000) at x509_d2.c:52:12 [opt]
    frame #35: 0x0000000101e81b74 _ssl.cpython-313d-darwin.so`SSL_CTX_load_verify_locations [inlined] SSL_CTX_load_verify_file(ctx=0x000000011d0a9600, CAfile=<unavailable>) at ssl_lib.c:4444:12 [opt]
    frame #36: 0x0000000101e81b64 _ssl.cpython-313d-darwin.so`SSL_CTX_load_verify_locations(ctx=0x000000011d0a9600, CAfile=<unavailable>, CApath=0x0000000000000000) at ssl_lib.c:4464:28 [opt]
    frame #37: 0x0000000101e652bc _ssl.cpython-313d-darwin.so`_ssl__SSLContext_load_verify_locations at _ssl.c:4173:13 [opt]
    frame #38: 0x0000000101e6511c _ssl.cpython-313d-darwin.so`_ssl__SSLContext_load_verify_locations(self=0x00000001018e5030, args=<unavailable>, nargs=<unavailable>, kwnames=<unavailable>) at _ssl.c.h:679:20 [opt]
    frame #39: 0x000000010009f478 python.exe`method_vectorcall_FASTCALL_KEYWORDS(func=0x00000001018f0110, args=<unavailable>, nargsf=<unavailable>, kwnames=0x0000000000000000) at descrobject.c:420:24 [opt]
    frame #40: 0x0000000100091ca0 python.exe`_PyObject_VectorcallTstate(tstate=0x000000010c611b00, callable=0x00000001018f0110, args=0x000000010193c2f0, nargsf=9223372036854775810, kwnames=0x0000000000000000) at pycore_call.h:168:11 [opt]
    frame #41: 0x0000000100092a58 python.exe`PyObject_Vectorcall(callable=<unavailable>, args=<unavailable>, nargsf=<unavailable>, kwnames=<unavailable>) at call.c:327:12 [opt] [artificial]
    frame #42: 0x00000001001b88fc python.exe`_PyEval_EvalFrameDefault(tstate=<unavailable>, frame=<unavailable>, throwflag=0) at generated_cases.c.h:815:23 [opt]
    frame #43: 0x0000000100095fc4 python.exe`_PyObject_VectorcallTstate(tstate=0x000000010c611b00, callable=0x0000000100af0110, args=0x000000028b07eee8, nargsf=1, kwnames=0x0000000000000000) at pycore_call.h:168:11 [opt]
    frame #44: 0x0000000100095164 python.exe`method_vectorcall(method=<unavailable>, args=<unavailable>, nargsf=<unavailable>, kwnames=0x0000000000000000) at classobject.c:70:20 [opt]
    frame #45: 0x00000001002cc8dc python.exe`thread_run(boot_raw=0x00006000004a8790) at _threadmodule.c:1207:21 [opt]
    frame #46: 0x0000000100261324 python.exe`pythread_wrapper(arg=<unavailable>) at thread_pthread.h:236:5 [opt]
    frame #47: 0x000000018a22a034 libsystem_pthread.dylib`_pthread_start + 136

This tries to unlock a NULL lock, and according to the source for line two this is the global engine lock:

    CRYPTO_THREAD_unlock(global_engine_lock);

From quickly scanning the openssl code base this lock is cleared from OPENSSL_cleanup which itself is called from an atexit(3) handler.

This seems to be caused by a running thread that is still active when the process is already shutting down, which happens here because the thread pool from https://github.com/python/cpython/issues/114653#issuecomment-1913646676 uses daemon threads and AFAIK Py_Finalize() won't wait for daemon threads to exit. And indeed, if I replace the 'raise RuntimeError' statement by 'pass' the script works without problems.

I'm not sure if we can fix this if my analysis is correct, there is a race condition between the main thread invoking the atexit(3) handlers and secondary threads still running openssl functions. From what I've found POSIX doesn't specify any ordering here.

UPATE: For completeness sake here's the stacktrace for the main thread which is running atexit handlers in particular openssl's cleanup:

(lldb) thread backtrace all
  thread #1, queue = 'com.apple.main-thread'
    frame #0: 0x000000018a1eec94 libsystem_kernel.dylib`__psynch_rw_wrlock + 8
    frame #1: 0x000000018a228610 libsystem_pthread.dylib`_pthread_rwlock_lock_wait + 84
    frame #2: 0x000000018a22673c libsystem_pthread.dylib`_pthread_rwlock_lock_slow + 720
    frame #3: 0x000000010164e08c _ssl.cpython-313d-darwin.so`CRYPTO_THREAD_write_lock(lock=<unavailable>) at threads_pthread.c:110:9 [opt]
    frame #4: 0x000000010163ffbc _ssl.cpython-313d-darwin.so`CRYPTO_free_ex_data [inlined] get_and_lock(global=0x00000001017db598, class_index=16) at ex_data.c:47:10 [opt]
    frame #5: 0x000000010163ffb0 _ssl.cpython-313d-darwin.so`CRYPTO_free_ex_data(class_index=16, obj=0x0000000000000000, ad=0x00000001017db588) at ex_data.c:378:10 [opt]
    frame #6: 0x000000010163c698 _ssl.cpython-313d-darwin.so`context_deinit(ctx=0x00000001017db580) at context.c:134:5 [opt]
    frame #7: 0x000000010163c624 _ssl.cpython-313d-darwin.so`ossl_lib_ctx_default_deinit at context.c:160:5 [opt]
    frame #8: 0x00000001016405b4 _ssl.cpython-313d-darwin.so`OPENSSL_cleanup at init.c:427:5 [opt]
    frame #9: 0x000000018a0e5fa8 libsystem_c.dylib`__cxa_finalize_ranges + 492
    frame #10: 0x000000018a0e5d20 libsystem_c.dylib`exit + 44
    frame #11: 0x000000018a242b80 libdyld.dylib`dyld4::LibSystemHelpers::exit(int) const + 20
    frame #12: 0x0000000189ea912c dyld`start + 2436
ronaldoussoren commented 8 months ago

Apparently OPENSSL_init_crypto has a flag OPENSSL_INIT_NO_ATEXIT to disable the atexit behaviour. AFAIK we don't call this function though.

A recent commit to openssl added a no-atexit configure option (https://github.com/openssl/openssl/commit/99fb31c167e322186c6f576cfaa8f433f4fed117), we could turn that option on in our macOS installer when we switch to a release that includes this option (that's not ideal though, this is not a macOS specific issue and also won't affect other installation (e.g. homebrew, anaconda).