Open orf opened 8 months ago
I got it to reliably crash again, and it seems that it's related to an exception being thrown in one thread whilst many other threads are inside _ssl
.
Upping the number of open files prevents the initial exception from happening, but whenever an exception is thrown python will quit with a SIGSEGV
.
Here is a crash due to a ValueError
being thrown in the main thread, triggered when trying to serialize an invalid object using json.dumps
:
Is it possible to reproduce without using third-party libraries?
I can try but I’m not sure if I will be able to and I don’t have a gigantic amount of time I can spend digging into this.
Looking at the thread traces, there don’t appear to be any stack traces that are in 3rd party native modules - it’s all in CPython or ssl internals. But the issue may be triggered by how boto3 implements connection pooling, which is something that I’m not familiar with. I’ll give it a go.
If I’m able to reproduce this again, is there any more information (outside of a reproducible test case) that would help with this?
I tried enabling faulthandler, but it didn’t output anything which makes me think it’s something that is erroring after Python has shut down.
Can this be reproduced without tqdm? Can this be reproduced with smaller number of threads or tasks? How many tasks were finished at the moment of the crash, how many tasks were started, and how much data the unfinished tasks read?
I've managed to create a reproduction case:
from multiprocessing.pool import ThreadPool
import ssl
import certifi
ctx = ssl.create_default_context()
location = certifi.where()
def test_func(_):
ctx.load_verify_locations(location)
with ThreadPool(processes=100) as pool:
for idx, item in enumerate(pool.imap_unordered(test_func, range(1000), chunksize=1)):
if idx == 100:
raise RuntimeError()
On MacOS, this reliably causes it to trigger:
Traceback (most recent call last):
File "/Users/tomforbes/tmp/scanner/repo.py", line 16, in <module>
raise RuntimeError()
RuntimeError
fish: Job 1, 'python repo.py' terminated by signal SIGSEGV (Address boundary error)
if you up the threads to something silly like 10k, it errors immediately:
Traceback (most recent call last):
File "/Users/tomforbes/tmp/scanner/repo.py", line 14, in <module>
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 873, in next
File "/Library/Frameworks/Python.framework/Versions/3.11/lib/python3.11/multiprocessing/pool.py", line 125, in worker
File "/Users/tomforbes/tmp/scanner/repo.py", line 10, in test_func
OSError: [Errno 24] Too many open files
fish: Job 1, 'python repo.py' terminated by signal SIGSEGV (Address boundary error)
Using exit()
causes the error, but os._exit(0)
doesn't. Judging by this issue, it could be because OpenSSL is being shutdown/cleaned up whilst threads are waiting on a lock? Or something?
I get the following stack trace when testing with a recentish checkout of main and openssl 3.0.12. OpenSSL was build without debug information, I'll have to rebuild it with debug information to get more information.
* thread #15, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
frame #0: 0x000000018a2268a0 libsystem_pthread.dylib`pthread_rwlock_unlock
frame #1: 0x0000000101ae605c libcrypto.3.dylib`CRYPTO_THREAD_unlock + 12
frame #2: 0x0000000101a96058 libcrypto.3.dylib`ENGINE_pkey_asn1_find_str + 224
frame #3: 0x00000001019d0170 libcrypto.3.dylib`EVP_PKEY_asn1_find_str + 116
frame #4: 0x0000000101ac4648 libcrypto.3.dylib`find_ameth + 64
frame #5: 0x0000000101ad5c28 libcrypto.3.dylib`ossl_namemap_doall_names + 156
frame #6: 0x0000000101ac1904 libcrypto.3.dylib`EVP_PKEY_set_type_by_keymgmt + 68
frame #7: 0x0000000101abb830 libcrypto.3.dylib`evp_keymgmt_util_make_pkey + 60
frame #8: 0x0000000101a8f078 libcrypto.3.dylib`decoder_construct_pkey + 440
frame #9: 0x0000000101a8c878 libcrypto.3.dylib`decoder_process + 192
* frame #10: 0x0000000101b81c0c libcrypto.3.dylib`der2key_decode + 672
Stack trace of the crashing thread with openssl debug info, with open 3.0.13:
* thread #28, stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
* frame #0: 0x000000018a2268a0 libsystem_pthread.dylib`pthread_rwlock_unlock
frame #1: 0x0000000101fb20a8 _ssl.cpython-313d-darwin.so`CRYPTO_THREAD_unlock(lock=<unavailable>) at threads_pthread.c:125:9 [opt]
frame #2: 0x0000000101f64ff4 _ssl.cpython-313d-darwin.so`ENGINE_pkey_asn1_find_str(pe=0x000000028b07dd38, str="2.5.8.1.1", len=9) at tb_asnmth.c:212:5 [opt]
frame #3: 0x0000000101ec6670 _ssl.cpython-313d-darwin.so`EVP_PKEY_asn1_find_str(pe=0x000000028b07dd98, str="2.5.8.1.1", len=9) at ameth_lib.c:117:17 [opt]
frame #4: 0x0000000101f90de0 _ssl.cpython-313d-darwin.so`find_ameth [inlined] pkey_set_type(pkey=0x0000000000000000, e=0x0000000000000000, type=0, str="2.5.8.1.1", len=<unavailable>, keymgmt=0x0000000000000000) at p_lib.c:1539:17 [opt]
frame #5: 0x0000000101f90dd4 _ssl.cpython-313d-darwin.so`find_ameth(name="2.5.8.1.1", data=0x000000028b07de28) at p_lib.c:1618:9 [opt]
frame #6: 0x0000000101fa1ae8 _ssl.cpython-313d-darwin.so`ossl_namemap_doall_names(namemap=<unavailable>, number=42697824, fn=(_ssl.cpython-313d-darwin.so`find_ameth at p_lib.c:1609), data=0x000000028b07de28) at core_namemap.c:171:9 [opt]
frame #7: 0x0000000101f80e78 _ssl.cpython-313d-darwin.so`evp_names_do_all(prov=<unavailable>, number=<unavailable>, fn=<unavailable>, data=<unavailable>) at evp_fetch.c:702:12 [opt] [artificial]
frame #8: 0x0000000101f893b4 _ssl.cpython-313d-darwin.so`EVP_KEYMGMT_names_do_all(keymgmt=<unavailable>, fn=<unavailable>, data=<unavailable>) at keymgmt_meth.c:302:16 [opt] [artificial]
frame #9: 0x0000000101f8e09c _ssl.cpython-313d-darwin.so`EVP_PKEY_set_type_by_keymgmt(pkey=0x000060000607eee0, keymgmt=0x00006000028b8460) at p_lib.c:1642:10 [opt]
frame #10: 0x0000000101f8817c _ssl.cpython-313d-darwin.so`evp_keymgmt_util_make_pkey [inlined] evp_keymgmt_util_assign_pkey(pkey=0x000060000607eee0, keymgmt=0x00006000028b8460, keydata=0x0000600006885960) at keymgmt_lib.c:64:13 [opt]
frame #11: 0x0000000101f88170 _ssl.cpython-313d-darwin.so`evp_keymgmt_util_make_pkey(keymgmt=0x00006000028b8460, keydata=0x0000600006885960) at keymgmt_lib.c:80:13 [opt]
frame #12: 0x0000000101f5e050 _ssl.cpython-313d-darwin.so`decoder_construct_pkey(decoder_inst=<unavailable>, params=<unavailable>, construct_data=0x0000600001ccc120) at decoder_pkey.c:172:24 [opt]
frame #13: 0x0000000101f5b850 _ssl.cpython-313d-darwin.so`decoder_process(params=0x000000028b07e018, arg=0x000000028b07e118) at decoder_lib.c:738:18 [opt]
frame #14: 0x0000000102040f6c _ssl.cpython-313d-darwin.so`der2key_decode(vctx=0x00006000050d52a0, cin=<unavailable>, selection=<unavailable>, data_cb=(_ssl.cpython-313d-darwin.so`decoder_process at decoder_lib.c:676), data_cbarg=0x000000028b07e118, pw_cb=<unavailable>, pw_cbarg=<unavailable>) at decode_der2key.c:299:14 [opt]
frame #15: 0x0000000101f5baa4 _ssl.cpython-313d-darwin.so`decoder_process(params=<unavailable>, arg=0x000000028b07e358) at decoder_lib.c:963:14 [opt]
frame #16: 0x0000000102043334 _ssl.cpython-313d-darwin.so`spki2typespki_decode(vctx=<unavailable>, cin=<unavailable>, selection=<unavailable>, data_cb=(_ssl.cpython-313d-darwin.so`decoder_process at decoder_lib.c:676), data_cbarg=0x000000028b07e358, pw_cb=<unavailable>, pw_cbarg=<unavailable>) at decode_spki2typespki.c:111:10 [opt]
frame #17: 0x0000000101f5baa4 _ssl.cpython-313d-darwin.so`decoder_process(params=<unavailable>, arg=0x000000028b07e418) at decoder_lib.c:963:14 [opt]
frame #18: 0x0000000101f5b680 _ssl.cpython-313d-darwin.so`OSSL_DECODER_from_bio(ctx=0x0000600007a3b5d0, in=0x00006000022a9200) at decoder_lib.c:81:10 [opt]
frame #19: 0x0000000101f5bc68 _ssl.cpython-313d-darwin.so`OSSL_DECODER_from_data(ctx=0x0000600007a3b5d0, pdata=0x000000028b07e4d8, pdata_len=0x000000028b07e4d0) at decoder_lib.c:156:9 [opt]
frame #20: 0x000000010202cd78 _ssl.cpython-313d-darwin.so`x509_pubkey_ex_d2i_ex(pval=<unavailable>, in=<unavailable>, len=362, it=<unavailable>, tag=-1, aclass=<unavailable>, opt='\0', ctx=<unavailable>, libctx=0x0000000000000000, propq=0x0000000000000000) at x_pubkey.c:217:17 [opt]
frame #21: 0x0000000101ece74c _ssl.cpython-313d-darwin.so`asn1_template_noexp_d2i(val=0x000000011c6f3020, in=0x000000028b07e680, len=<unavailable>, tt=0x0000000102115de8, opt='\0', ctx=0x000000028b07e880, depth=2, libctx=0x0000000000000000, propq=0x0000000000000000) at tasn_dec.c:682:15 [opt]
frame #22: 0x0000000101ecdaf4 _ssl.cpython-313d-darwin.so`asn1_template_ex_d2i(val=<unavailable>, in=<unavailable>, inlen=<unavailable>, tt=<unavailable>, opt=<unavailable>, ctx=<unavailable>, depth=<unavailable>, libctx=<unavailable>, propq=<no value available>) at tasn_dec.c:558:16 [opt] [artificial]
frame #23: 0x0000000101ecd5bc _ssl.cpython-313d-darwin.so`asn1_item_embed_d2i(pval=0x000000028b07e730, in=0x000000028b07e728, len=<unavailable>, it=0x0000000102115cc0, tag=<unavailable>, aclass=<unavailable>, opt=<unavailable>, ctx=0x000000028b07e880, depth=2, libctx=0x0000000000000000, propq=0x0000000000000000) at tasn_dec.c:422:19 [opt]
frame #24: 0x0000000101ece74c _ssl.cpython-313d-darwin.so`asn1_template_noexp_d2i(val=0x000000028b07e730, in=0x000000028b07e7f0, len=<unavailable>, tt=0x0000000102115ec0, opt='\0', ctx=0x000000028b07e880, depth=1, libctx=0x0000000000000000, propq=0x0000000000000000) at tasn_dec.c:682:15 [opt]
frame #25: 0x0000000101ecdaf4 _ssl.cpython-313d-darwin.so`asn1_template_ex_d2i(val=<unavailable>, in=<unavailable>, inlen=<unavailable>, tt=<unavailable>, opt=<unavailable>, ctx=<unavailable>, depth=<unavailable>, libctx=<unavailable>, propq=<no value available>) at tasn_dec.c:558:16 [opt] [artificial]
frame #26: 0x0000000101ecd5bc _ssl.cpython-313d-darwin.so`asn1_item_embed_d2i(pval=0x0000600000baa780, in=0x000000028b07e8f0, len=<unavailable>, it=0x0000000102115e88, tag=<unavailable>, aclass=<unavailable>, opt=<unavailable>, ctx=0x000000028b07e880, depth=1, libctx=0x0000000000000000, propq=0x0000000000000000) at tasn_dec.c:422:19 [opt]
frame #27: 0x0000000101eccf6c _ssl.cpython-313d-darwin.so`ASN1_item_d2i [inlined] asn1_item_ex_d2i_intern(pval=0x0000600000baa780, in=<unavailable>, len=<unavailable>, it=0x0000000102115e88, tag=-1, aclass=0, opt='\0', ctx=0x000000028b07e880, libctx=0x0000000000000000, propq=0x0000000000000000) at tasn_dec.c:118:10 [opt]
frame #28: 0x0000000101eccf44 _ssl.cpython-313d-darwin.so`ASN1_item_d2i [inlined] ASN1_item_d2i_ex(pval=0x0000600000baa780, in=<unavailable>, len=<unavailable>, it=0x0000000102115e88, libctx=0x0000000000000000, propq=0x0000000000000000) at tasn_dec.c:144:9 [opt]
frame #29: 0x0000000101eccf30 _ssl.cpython-313d-darwin.so`ASN1_item_d2i(pval=0x0000600000baa780, in=<unavailable>, len=<unavailable>, it=0x0000000102115e88) at tasn_dec.c:154:12 [opt]
frame #30: 0x000000010202d1bc _ssl.cpython-313d-darwin.so`d2i_X509(a=<unavailable>, in=<unavailable>, len=<unavailable>) at x_x509.c:138:1 [opt] [artificial]
frame #31: 0x0000000101fc3888 _ssl.cpython-313d-darwin.so`PEM_X509_INFO_read_bio_ex(bp=0x00006000022b2eb0, sk=0x0000000000000000, cb=<unavailable>, u=<unavailable>, libctx=0x0000000000000000, propq=<unavailable>) at pem_info.c:168:28 [opt]
frame #32: 0x000000010200237c _ssl.cpython-313d-darwin.so`X509_load_cert_crl_file_ex(ctx=0x00006000011a6da0, file=<unavailable>, type=1, libctx=0x0000000000000000, propq=0x0000000000000000) at by_file.c:231:11 [opt]
frame #33: 0x00000001020025b0 _ssl.cpython-313d-darwin.so`by_file_ctrl_ex(ctx=0x00006000011a6da0, cmd=<unavailable>, argp=<unavailable>, argl=<unavailable>, ret=<unavailable>, libctx=0x0000000000000000, propq=0x0000000000000000) at by_file.c:72:23 [opt]
frame #34: 0x000000010201d088 _ssl.cpython-313d-darwin.so`X509_STORE_load_file_ex(ctx=0x00006000020bc1e0, file="/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/certifi/cacert.pem", libctx=0x0000000000000000, propq=0x0000000000000000) at x509_d2.c:52:12 [opt]
frame #35: 0x0000000101e81b74 _ssl.cpython-313d-darwin.so`SSL_CTX_load_verify_locations [inlined] SSL_CTX_load_verify_file(ctx=0x000000011d0a9600, CAfile=<unavailable>) at ssl_lib.c:4444:12 [opt]
frame #36: 0x0000000101e81b64 _ssl.cpython-313d-darwin.so`SSL_CTX_load_verify_locations(ctx=0x000000011d0a9600, CAfile=<unavailable>, CApath=0x0000000000000000) at ssl_lib.c:4464:28 [opt]
frame #37: 0x0000000101e652bc _ssl.cpython-313d-darwin.so`_ssl__SSLContext_load_verify_locations at _ssl.c:4173:13 [opt]
frame #38: 0x0000000101e6511c _ssl.cpython-313d-darwin.so`_ssl__SSLContext_load_verify_locations(self=0x00000001018e5030, args=<unavailable>, nargs=<unavailable>, kwnames=<unavailable>) at _ssl.c.h:679:20 [opt]
frame #39: 0x000000010009f478 python.exe`method_vectorcall_FASTCALL_KEYWORDS(func=0x00000001018f0110, args=<unavailable>, nargsf=<unavailable>, kwnames=0x0000000000000000) at descrobject.c:420:24 [opt]
frame #40: 0x0000000100091ca0 python.exe`_PyObject_VectorcallTstate(tstate=0x000000010c611b00, callable=0x00000001018f0110, args=0x000000010193c2f0, nargsf=9223372036854775810, kwnames=0x0000000000000000) at pycore_call.h:168:11 [opt]
frame #41: 0x0000000100092a58 python.exe`PyObject_Vectorcall(callable=<unavailable>, args=<unavailable>, nargsf=<unavailable>, kwnames=<unavailable>) at call.c:327:12 [opt] [artificial]
frame #42: 0x00000001001b88fc python.exe`_PyEval_EvalFrameDefault(tstate=<unavailable>, frame=<unavailable>, throwflag=0) at generated_cases.c.h:815:23 [opt]
frame #43: 0x0000000100095fc4 python.exe`_PyObject_VectorcallTstate(tstate=0x000000010c611b00, callable=0x0000000100af0110, args=0x000000028b07eee8, nargsf=1, kwnames=0x0000000000000000) at pycore_call.h:168:11 [opt]
frame #44: 0x0000000100095164 python.exe`method_vectorcall(method=<unavailable>, args=<unavailable>, nargsf=<unavailable>, kwnames=0x0000000000000000) at classobject.c:70:20 [opt]
frame #45: 0x00000001002cc8dc python.exe`thread_run(boot_raw=0x00006000004a8790) at _threadmodule.c:1207:21 [opt]
frame #46: 0x0000000100261324 python.exe`pythread_wrapper(arg=<unavailable>) at thread_pthread.h:236:5 [opt]
frame #47: 0x000000018a22a034 libsystem_pthread.dylib`_pthread_start + 136
This tries to unlock a NULL lock, and according to the source for line two this is the global engine lock:
CRYPTO_THREAD_unlock(global_engine_lock);
From quickly scanning the openssl code base this lock is cleared from OPENSSL_cleanup
which itself is called from an atexit(3) handler.
This seems to be caused by a running thread that is still active when the process is already shutting down, which happens here because the thread pool from https://github.com/python/cpython/issues/114653#issuecomment-1913646676 uses daemon threads and AFAIK Py_Finalize() won't wait for daemon threads to exit. And indeed, if I replace the 'raise RuntimeError' statement by 'pass' the script works without problems.
I'm not sure if we can fix this if my analysis is correct, there is a race condition between the main thread invoking the atexit(3) handlers and secondary threads still running openssl functions. From what I've found POSIX doesn't specify any ordering here.
UPATE: For completeness sake here's the stacktrace for the main thread which is running atexit handlers in particular openssl's cleanup:
(lldb) thread backtrace all
thread #1, queue = 'com.apple.main-thread'
frame #0: 0x000000018a1eec94 libsystem_kernel.dylib`__psynch_rw_wrlock + 8
frame #1: 0x000000018a228610 libsystem_pthread.dylib`_pthread_rwlock_lock_wait + 84
frame #2: 0x000000018a22673c libsystem_pthread.dylib`_pthread_rwlock_lock_slow + 720
frame #3: 0x000000010164e08c _ssl.cpython-313d-darwin.so`CRYPTO_THREAD_write_lock(lock=<unavailable>) at threads_pthread.c:110:9 [opt]
frame #4: 0x000000010163ffbc _ssl.cpython-313d-darwin.so`CRYPTO_free_ex_data [inlined] get_and_lock(global=0x00000001017db598, class_index=16) at ex_data.c:47:10 [opt]
frame #5: 0x000000010163ffb0 _ssl.cpython-313d-darwin.so`CRYPTO_free_ex_data(class_index=16, obj=0x0000000000000000, ad=0x00000001017db588) at ex_data.c:378:10 [opt]
frame #6: 0x000000010163c698 _ssl.cpython-313d-darwin.so`context_deinit(ctx=0x00000001017db580) at context.c:134:5 [opt]
frame #7: 0x000000010163c624 _ssl.cpython-313d-darwin.so`ossl_lib_ctx_default_deinit at context.c:160:5 [opt]
frame #8: 0x00000001016405b4 _ssl.cpython-313d-darwin.so`OPENSSL_cleanup at init.c:427:5 [opt]
frame #9: 0x000000018a0e5fa8 libsystem_c.dylib`__cxa_finalize_ranges + 492
frame #10: 0x000000018a0e5d20 libsystem_c.dylib`exit + 44
frame #11: 0x000000018a242b80 libdyld.dylib`dyld4::LibSystemHelpers::exit(int) const + 20
frame #12: 0x0000000189ea912c dyld`start + 2436
Apparently OPENSSL_init_crypto
has a flag OPENSSL_INIT_NO_ATEXIT
to disable the atexit
behaviour. AFAIK we don't call this function though.
A recent commit to openssl added a no-atexit configure option (https://github.com/openssl/openssl/commit/99fb31c167e322186c6f576cfaa8f433f4fed117), we could turn that option on in our macOS installer when we switch to a release that includes this option (that's not ideal though, this is not a macOS specific issue and also won't affect other installation (e.g. homebrew, anaconda).
Crash report
What happened?
On MacOS, calling
ssl.SSLContext.load_verify_locations
from multiple threads causes aSIGSEGV
during process termination when an unhandled exception is raised.I was able to create a repoduction case and I've attached the five MacOS crash reports here.
Reproduction
MacOS crash reports:
five.txt four.txt one.txt three.txt two.txt
CPython versions tested on:
3.11
Operating systems tested on:
macOS
Output from running 'python -VV' on the command line:
Python 3.11.7 (v3.11.7:fa7a6f2303, Dec 4 2023, 15:22:56) [Clang 13.0.0 (clang-1300.0.29.30)]