Open sebastianas opened 1 year ago
I'll try to look at it.
@sebastianas could you please run the command in question under valgrind with and without this commit?
valgrind --leak-check=full --show-leak-kinds=all $cmd
thats....interesting
In the valgrind run without the commit, theres a series of leaks, which is quite odd, given that the commit in question increases the refcount of the private key in the pkcs11 token by 1 (above what it would have been without the commit), so I would have, if anything expected to see leaks with the commit rather than without
More interesting is the valgrind run with the commit
In this case there are a number of (effectively) use after free cases in the hsm module, triggered by the C++ runtime running an atexit handler which deletes the hsm singleton instance that holds all the data access the openssl module accesses through the engine api. Whats more odd is how the commit in question would have prevented this behavior, given that atexit handlers should always run in reverse order (i.e. the loading of the HSM module should always occur after OPENSSL_init where we register our atexit handler), so I'm not sure how this doesn't happen regardless of the inclusion status of the commit in question
@nhorman why does it increase the refcount? I call the _free function in the commit in question so the refcount shouldn't change
There are 2 issues here.
Issue 1, softhsm should really , really , really, use their own libctx,a nd stop stomping on the default context that should be reserved for the main application.
Issue 2 atexit() handler, which is often called before the engine tries to close all sessions, one could argue that openssl is wrong in installing an atexit() handler w/o being directed by the application which is the only entity that has a chance of knowin in which order atexit() handlers should be installed (as it influences the order in which they are run.
But really softhsm should handle this with fixing issue 1, then it can use its own context and not get messed up when it is requested to close sessions after openssl already destroyed the default library context.
Note that softhsm is not the only pkcs11 module that has these issues, so much so I had to add this PR to the pkcs11-provider which experienced the same problems as the engine early on: https://github.com/latchset/pkcs11-provider/pull/225
@beldmit I might be mistaken, but for every keytype in that patch, you call {keytype}_get1, followed by EVP_set1. Each of those calls ups the refcnt of the keytype. You do call keytype_free, but that only releases one of the two counts.
@simo5 that all makes sense, but I would be a bit concerned about the fact that the atexit handler in question is/may be part of the c++ runtime (see the free backtrace in the valgrind log). If that's the case, the c++ runtime is freeing global objects that the main application still has references too, and will use in the openssl atexit handler. Honestly I'm not sure how to fix that without abandoning the use of atexit entirely.
@simo5:
Issue 2 atexit() handler, which is often called before the engine tries to close all sessions, one could argue that openssl is wrong in installing an atexit() handler w/o being directed by the application which is the only entity that has a chance of knowin in which order atexit() handlers should be installed (as it influences the order in which they are run.
FYI: Please see (closed) issues #17059 and #17537 for more detailed analysis of issues around atexit()
.
@nhorman If softhsm is setting up an atexit() handler, then they absolutely need to fix that too (but I do not believe this is the case), they are invoked through at least 2/3 layers of indirection, it is not possible to safely use an atexit() handler in that case, and it is not needed in the first place, pkcs11 modules have a C_Finalize() function that dictates when they must and are safe to be unloaded. Doing it earlier is definitely wrong.
@rsbeckerca yeah I am painfully aware of those discussion, I thin openssl is wrong at setting up atexit(), applications need to unload it. Of course there is the issue that openssl is also invoked via very indirect library calls (libraries calling liabraries calling libraries calling into openssl), for those cases I think it is better to leak memory/handlers that to try to atexit() anyway. After all we have things like FD_CLOEXEC to avoid resource leaks on exec() and on a real exit all memory resources and file descriptors will be released anyway when the process terminates.
@rsbeckerca yeah I am painfully aware of those discussion, I thin openssl is wrong at setting up atexit(), applications need to unload it. Of course there is the issue that openssl is also invoked via very indirect library calls (libraries calling liabraries calling libraries calling into openssl), for those cases I think it is better to leak memory/handlers that to try to atexit() anyway. After all we have things like FD_CLOEXEC to avoid resource leaks on exec() and on a real exit all memory resources and file descriptors will be released anyway when the process terminates.
I tend to agree. Fortunately for my platform, memory leaks during atexit()
processing are pretty much irrelevant as once a process is gone, the global heap cleans itself up without fear; however, I have sympathy as one of our modules is moving to Windows (shhhh - its a secret that everyone now knows), and a new threading model that will be impacted, and that is going to make my life much more difficult.
People seem to run into this atexit handler regularly. The reason we do this by default is so that something like valgrind doesn't report issues.
We might have to rethink how we should do this. For instance we might need to keep a reference counter of how many times we've been initialized. But that's probably harder than it sounds.
Much harder than it sounds which is why it's not moving forward well.
From what (little) I can tell, I don't think that atexit handler is getting setup from any other component directly. Based on the stack trace, it looks like the c++ runtime may be using atexit to clean up global objects. Not saying it's right, just saying that I can't see any other way for the delete method of an object to get called directly from an atexit handler
Much harder than it sounds which is why it's not moving forward well.
The suggestion I have for this (used in other situations) is to maintain an opaque structure that the caller has to pass around for cleanup purposes instead of using atexit()
. The caller can then invoke a new variant of OPENSSL_Cleanup(context)
during their own atexit()
processing. OpenSSL methods can add to this context for things needing to be invoked at exit. Obviously this requires an API change, so out of context until post 3.2. Just my musings, but it gets around handing multiple heaps that messes with atexit()
.
Anyway let me debug tomorrow. If my patch is wrong and can't easily be fixed, then it will be better to roll it back. Otherwise if it is correct or can easily be fixed, the other problems should be dealt with.
People seem to run into this atexit handler regularly. The reason we do this by default is so that something like valgrind doesn't report issues. We might have to rethink how we should do this. For instance we might need to keep a reference counter of how many times we've been initialized. But that's probably harder than it sounds.
I do not think you can do that for the default handler, as initialization is implicit and simply does not happen multiple times just because different libraries from within the same process call into openssl, the first one will cause initialization and all the others wont.
You could keep track of all objects and defer de-initialization until all of them have been freed. But that would be functionally equivalent to not installing an atexit() handler if applications (or libraries calling into openssl) are misbehaved and do not clean up all the contexts they have allocated at exit.
From what (little) I can tell, I don't think that atexit handler is getting setup from any other component directly. Based on the stack trace, it looks like the c++ runtime may be using atexit to clean up global objects. Not saying it's right, just saying that I can't see any other way for the delete method of an object to get called directly from an atexit handler
It isn't atexit, there is usually special code in the main
wrapper that calls the dtor list.
@paulidale @kroeckx
However, there is one thing that can be done, and that is to ensure that on atexit() (edit: and on OPENSSL_cleanup() ) the correct ordering is used to free internal openssl data.
Providers must be the first* thing ever that is released, and nothing else should be released until providers are freed in the inverse order they have been loaded (in case any provider depends on other provider to perform their shutdown).
Of course the devil is in the details as various context may hold pointers to *data hold by providers ... so those context need to be freed before the providers are ... otherwise when the providers vtables ->free calls ae called things will again end up in tears ...
This would at least cover up the main issues with softhsm imploding when the pks11 provider calls C_CLoseSession or C_Finalize methods.
softhsm fails in most of these cases because the default openssl context is already half deallocated before the providers (or engines) call the pkcs11 drivers finalizers.
So there is at least an ordering bug in the openssl cleanup logic that may not be easy to fix without a lot of introspection on what all the objects themselves reference.
@paulidale @kroeckx
However, there is one thing that can be done, and that is to ensure that on atexit() (edit: and on OPENSSL_cleanup() ) the correct ordering is used to free internal openssl data.
Providers must be the first* thing ever that is released, and nothing else should be released until providers are freed in the inverse order they have been loaded (in case any provider depends on other provider to perform their shutdown).
Of course the devil is in the details as various context may hold pointers to *data hold by providers ... so those context need to be freed before the providers are ... otherwise when the providers vtables ->free calls ae called things will again end up in tears ...
This would at least cover up the main issues with softhsm imploding when the pks11 provider calls C_CLoseSession or C_Finalize methods.
softhsm fails in most of these cases because the default openssl context is already half deallocated before the providers (or engines) call the pkcs11 drivers finalizers.
So there is at least an ordering bug in the openssl cleanup logic that may not be easy to fix without a lot of introspection on what all the objects themselves reference.
I think you will find that different implementations make atexit()
order guarantees problematic. That was part of the issue we found in prior cases. There is no portable ordering guarantee.
@paulidale that doesn't quite mesh with what the valgrind log shows:
Engine "pkcs11" set.
==162888== Invalid read of size 4
==162888== at 0x53CE278: detectFork (SoftHSM.cpp:12445)
==162888== by 0x53CE278: SoftHSM::i() (SoftHSM.cpp:349)
==162888== by 0x53A2408: C_CloseAllSessions (main.cpp:347)
==162888== by 0x53779B9: pkcs11_slot_unref (p11_slot.c:433)
==162888== by 0x5377A4F: pkcs11_release_slot (p11_slot.c:477)
==162888== by 0x5377A4F: pkcs11_release_all_slots (p11_slot.c:464)
==162888== by 0x5370F4D: ctx_finish (eng_back.c:352)
==162888== by 0x536EE37: engine_finish (eng_front.c:163)
==162888== by 0x4AE1952: engine_unlocked_finish (eng_init.c:64)
==162888== by 0x4AE3F85: int_cleanup_cb_doall (eng_table.c:183)
==162888== by 0x4AE3F85: int_cleanup_cb_doall (eng_table.c:177)
==162888== by 0x4B2C10B: doall_util_fn (lhash.c:197)
==162888== by 0x4B2C10B: OPENSSL_LH_doall (lhash.c:205)
==162888== by 0x4AE4364: lh_ENGINE_PILE_doall (eng_local.h:159)
==162888== by 0x4AE4364: engine_table_cleanup (eng_table.c:192)
==162888== by 0x4AE1BD9: engine_cleanup_cb_free (eng_lib.c:169)
==162888== by 0x4BAE16F: OPENSSL_sk_pop_free (stack.c:426)
==162888== Address 0x4fedfc8 is 136 bytes inside a block of size 144 free'd
==162888== at 0x484399B: operator delete(void*, unsigned long) (vg_replace_malloc.c:935)
==162888== by 0x4DC3954: __run_exit_handlers (exit.c:111) <===== HERE
==162888== by 0x4DC3A89: exit (exit.c:141)
==162888== by 0x14A2A5: main (in /home/bigeasy/openssl/tb/openssl/build/apps/openssl)
==162888== Block was alloc'd at
==162888== at 0x4840F2F: operator new(unsigned long) (vg_replace_malloc.c:422)
Its certainly possible that the C++ runtime is doing something special in the main wrapper, but one way or another, its using the exit handler path to make it happen.
@rsbeckerca atexit should guarantee that registered function order is exactly the reverse of the registration order (ie.. calling atexit(a), atexit(b), atexit(c) implies that when exit is called, the registered functions should be called explicitly in c, b, a order.
That said, I'm struggling to see how this can ever work. I say that because OPENSSL_init_crypto should always be called prior to any dynamic engine being loaded (at least, I think it should). Given the OPENSSL_init_crypto is where the OPENSSL_cleanup handler is atexit registered, we should be guararnted that it will run after any subsequent atexit calls are made.
Assuming that softhsm registers an atexit handler (be it via explicit call, which I can't find), or via some wrapper magic in the c++ runtime, we should be guaranteed that the softhsm atexit handler will run prior to the openssl atexit handler. The implication here being that we should be guaranteed that the softhsm singleton object gets freed prior to the pcks11 library shutting down, causing the error the use after free error we are seeing.
I'm sure I'm missing something, but right now this seems very broken.
@rsbeckerca atexit should guarantee that registered function order is exactly the reverse of the registration order (ie.. calling atexit(a), atexit(b), atexit(c) implies that when exit is called, the registered functions should be called explicitly in c, b, a order.
That said, I'm struggling to see how this can ever work. I say that because OPENSSL_init_crypto should always be called prior to any dynamic engine being loaded (at least, I think it should). Given the OPENSSL_init_crypto is where the OPENSSL_cleanup handler is atexit registered, we should be guararnted that it will run after any subsequent atexit calls are made.
Assuming that softhsm registers an atexit handler (be it via explicit call, which I can't find), or via some wrapper magic in the c++ runtime, we should be guaranteed that the softhsm atexit handler will run prior to the openssl atexit handler. The implication here being that we should be guaranteed that the softhsm singleton object gets freed prior to the pcks11 library shutting down, causing the error the use after free error we are seeing.
I'm sure I'm missing something, but right now this seems very broken.
@nhorman "Should" is the operative word. In some exotic architectures (you know of whom I speak), the memory space of the DLL and main program is different, meaning that the lists maintained by atexit()
are in separate spaces for the DLL and the main program. This results in a fundamental inability to reverse-sequence what is added vs. processed. In one case (mine), when a DLL unloads due to dlclose()
, atexit()
inside the DLL's memory space is processed in the correct order from the point of view of the DLL. This is regardless of what the main program does. When the main program exits, its own atexit()
list is processed, which is also correct, from the point of view of the main program, but that list does not know about the DLL. The order is not guaranteed across the memory boundaries of the two (DLL vs. main). If the run-time was more robust (and the POSIX spec required an unambiguous compliance, which the OS dev team proved does not), this might not be an issue. However, there is actually no guarantee the DLL in an MPP architecture, is in the same chip-set as the main program, so coordination is difficult at best, impractical at worse.
@rsbeckerca that's the weird thing. POSIX does mandate an unambiguous execution order: https://pubs.opengroup.org/onlinepubs/007904875/functions/atexit.html
But in an ironic twist, it seems to me that, if and only if that required order is honored, will this breakage occur. I.e. if on a more esoteric platform that atexit order cannot be guaranteed, the execution of an applications atexit handler occurs prior to the softhsm atexit handler, this problem would not occur, as openssl would release any reference to the softhsm prior to its implied delete
Clearly I'm missing something here, need to look at this more closely
I think you will find that different implementations make
atexit()
order guarantees problematic. That was part of the issue we found in prior cases. There is no portable ordering guarantee.
I am talking about the order in which openssl frees its internal structures, I am pretty sure it is not correct, not about atexit ordering
the C_CloseAllSessions, or C_CLoseSession, or C_Finalize are all called byt the engine shutting down, the engine is shut down as part of openssl's atexit(), the problem is that when it calls the engine's shutdown it is already too late, some of the opnessl context has already been freed, and when softhsm calls into openssl as part of closing sessions (likely to free some objects it allocated, like EVP_PKEY's) it ends up double freeing or encountering null pointers or in general bad status in openssl context, which ends in tears (and segfaults).
I don't think that's what's going on. I agree that there is an ordering problem here in terms of what is getting freed when,.but if you look at the invalid reads in the valgrind log, and the segfault backtrace, both reference data that are class private to the softhsm, and don't appear to be directly referenced by openssl. I.e. an atexit handler called the delete method form the hsm class, then later,.the openssl atexit handler attempted.to tear down the engine that used the hsm, and.wound up trying to read/write data that had already been freed in the hsm class.object instance
I'm sure I'm missing something, but right now this seems very broken.
I think what you're missing is what happens to global destructors. The atexit
processing is part of the usual exit
library call which happens either by calling exit
or by returning from main
. After this the _exit
system call is made and the kernel starts cleaning things up. As part of this process the dynamic loader runs the .fini
code for the main program and each loaded shared library (all the elf format things) which iterates through the associated .fini_array
section and calls each function in turn. This is where C++ global classes are cleaned up. So these are cleansed on library unload well after atexit
processing has finished.
@nhorman said:
given that the commit in question increases the refcount of the private key in the pkcs11 token by 1 (above what it would have been without the commit), so I would have, if anything expected to see leaks with the commit rather than without
and
@beldmit I might be mistaken, but for every keytype in that patch, you call {keytype}_get1, followed by EVP_set1. Each of those calls ups the refcnt of the keytype. You do call keytype_free, but that only releases one of the two counts.
I'm not seeing an increase in refcnt.
The pattern for each legacy key type is:
RSA *rsa = EVP_PKEY_get1_RSA(pkey);
EVP_PKEY_set1_RSA(pkey, rsa);
RSA_free(rsa);
Lets assume the RSA object starts with a ref count of 1. After the EVP_PKEY_get1_RSA
call it now has a recnt of 2. The EVP_PKEY_set1_RSA
call does multiple things:
1) Up the ref count on the incoming RSA object. So the refcnt is now 3.
2) Frees any existing RSA object inside the EVP_PKEY (decrementing its refcnt). Since the existing RSA object is the same as the incoming RSA, this decrements the refcnt back to 2.
3) Replaces the old RSA object with the incoming RSA object (they are the same, so no change)
Finally RSA_free
is called which decrements the refcnt to 1.
So we start with a refcnt of 1, and end with a refcnt of 1.
Instead of the get1/set1 dance it would be possible to just make the detect_foreign_key() a non-static internal function and call it after e->load_privkey(). Unfortunately that also indicates that the problem is much deeper and it does not lie in @beldmit 's patch. Because that is inocuous and IMO right thing to do. The problem is that the atexit handling basically does not work when combined with any non-trivial engines and @beldmit 's patch just makes the problem triggered at more places.
@sebastianas I tend to agree. The patch reveals the problem but isn't the source of it.
I have tested with this snipped of code and think that the commit in question doesn't introduce a memory leak in simple cases.
#include <openssl/evp.h>
#include <openssl/rsa.h>
int main (void) {
EVP_PKEY *pkey = NULL;
RSA *rsa = NULL, *rsa0 = NULL;
pkey = EVP_PKEY_new();
rsa = RSA_new();
EVP_PKEY_set1_RSA(pkey, rsa);
/**/
rsa0 = EVP_PKEY_get1_RSA(pkey);
EVP_PKEY_set1_RSA(pkey, rsa0);
RSA_free(rsa0);
/**/
EVP_PKEY_free(pkey);
RSA_free(rsa);
return 0;
}
Comparing the valgrind logs, I have got an impression that there were problems with memory management in the pkcs11-engine (e.g. some free functions were not called) and now they became called.
@t8m yes, calling detect_foreign_key
was my intention but I didn't want to turn it into an internal function.
@t8m yes, calling
detect_foreign_key
was my intention but I didn't want to turn it into an internal function.
Why? I do not really see any disadvantage to doing that.
@beldmit I agree. After running your extracted test case, there is pretty clearly a refcount drop somewhere in EVP_PKEY_set1_RSA that I'm just not seeing, so thats all properly balanced, apologies for the noise.
That said, thats all just a side story to the larger issue of the atexit ordering issue.
To add some detail to that, what I've found is that C++ (which is what SoftHSM is written in), makes use of a unique_ptr to store a global singleton of the SoftHSM class). Thats well and good, but because the singleton is global data, C11 indicates that the objects destructor must be called either on return from main, or when exit is called. It would seem that the C runtime (at least on linux) implements this by either calling atexit or (in the case of glibc), the internal __cxa_atexit function, both of which appear to be emitted by the compiler.
So we have a situation in which the following order of events must occur when using openssl with pkcs11 and SoftHSM2 1) Openssl library initalizes (registering the OPENSSL_cleanup atexit() handler 2) The pkcs11 library is loaded 3) The SoftHSM library is loaded, instantiating a singleton SoftHSM object, which implicitly registers a callback to its own destructor via atexit/__csx_atexit
Because posix mandates atexit handlers are called in reverse registration order, the above order of registration seems to guarantee that the SoftHSM object will be freed prior to any openssl shutdown operations, which in turn will be guaranteed to make calls through the pkcs11 engine to tear down data it thinks is still available, but has already been freed.
I need to think about it more, but It seems the only reasonable way to fix this is to ensure that openssl teardown occurs prior to exit being called, so as to release any application level references to the SoftHSM data prior to its implicit destructor call from the exit handlers
But thats sort of flabbergasting, as the implication there is that the C++ runtime, in using atexit to call destructors has effectively said that any prior atexit registrations can't rely on data being valid when they are called, effectively making atexit completely unusable for anyone outside the C++ environment. Thats....awful.
@t8m yes, calling
detect_foreign_key
was my intention but I didn't want to turn it into an internal function.Why? I do not really see any disadvantage to doing that.
I didn't want to invoke it from engine-related code. Probably it's my fault.
Btw, does this issue deserve a regression label?
one thing we could try here, just to confirm what the valgrind logs are telling us, is this patch to SoftHSM2:
diff --git a/src/lib/SoftHSM.cpp b/src/lib/SoftHSM.cpp
index 02c0f95..b26d5dc 100644
--- a/src/lib/SoftHSM.cpp
+++ b/src/lib/SoftHSM.cpp
@@ -429,6 +429,9 @@ SoftHSM::SoftHSM()
// Destructor
SoftHSM::~SoftHSM()
{
+
+ if (isInitialised == true)
+ return;
if (handleManager != NULL) delete handleManager;
handleManager = NULL;
if (sessionManager != NULL) delete sessionManager;
That would let the destructor avoid the deletion of its member elements, allowing OPENSSL_cleanup to clean its references to them properly. It will likely result in a memory leak at exit within the SoftHSM library , but for the purposes of testing, thats likely fine. @sebastianas can you please give that a try?
If that avoids the problem we likely need to have a larger discussion about either rewriting how our own exit handling is done, or publishing some requirements on engine memory lifetimes.
@beldmit I'm not sure we can strictly call it a regression, as I think the problem has always been there, but pragmatically speaking, I'm sure users would see it as a regression, yes.
@sebastianas another option to try, if changing SoftHSM isn't feasible:
diff --git a/apps/req.c b/apps/req.c
index 3ce2b38496..96bb3d23ae 100644
--- a/apps/req.c
+++ b/apps/req.c
@@ -1045,6 +1045,7 @@ int req_main(int argc, char **argv)
OPENSSL_free(passin);
if (passout != nofree_passout)
OPENSSL_free(passout);
+ OPENSSL_cleanup();
return ret;
}
That should force openssl req to cleanup prior to returning from main/exiting, allowing us to drop references to the engine prior to C++ deleting anything
On 2023-10-26 06:32:54 [-0700], Neil Horman wrote:
@sebastianas can you please give that a try? Sure. OpenSSL back to 3.0.12, softshm got that patch. Still segfaults, different backtrace:
Engine "pkcs11" set.
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000091 in ?? ()
(gdb) bt
#0 0x0000000000000091 in ?? ()
#1 0x00007ffff77cacf0 in SecureDataManager::~SecureDataManager (this=0x555555687ea0, __in_chrg=<optimized out>) at ./src/lib/data_mgr/SecureDataManager.cpp:96
#2 0x00007ffff77caebd in SecureDataManager::~SecureDataManager (this=0x555555687ea0, __in_chrg=<optimized out>) at ./src/lib/data_mgr/SecureDataManager.cpp:102
#3 0x00007ffff77e80c1 in Token::~Token (this=0x555555680a90, __in_chrg=<optimized out>) at ./src/lib/slot_mgr/Token.cpp:69
#4 Token::~Token (this=0x555555680a90, __in_chrg=<optimized out>) at ./src/lib/slot_mgr/Token.cpp:72
#5 0x00007ffff77e7b01 in Slot::~Slot (this=0x555555680a60, __in_chrg=<optimized out>) at ./src/lib/slot_mgr/Slot.cpp:61
#6 Slot::~Slot (this=0x555555680a60, __in_chrg=<optimized out>) at ./src/lib/slot_mgr/Slot.cpp:62
#7 0x00007ffff77e712f in SlotManager::~SlotManager (this=0x555555680130, __in_chrg=<optimized out>) at ./src/lib/slot_mgr/SlotManager.cpp:93
#8 0x00007ffff77e71ed in SlotManager::~SlotManager (this=0x555555680130, __in_chrg=<optimized out>) at ./src/lib/slot_mgr/SlotManager.cpp:95
#9 0x00007ffff77816b9 in SoftHSM::C_Finalize (this=0x55555565a9b0, ***@***.***=0x0) at ./src/lib/SoftHSM.cpp:594
#10 0x00007ffff7762ac8 in C_Finalize (pReserved=0x0) at ./src/lib/main.cpp:148
#11 0x00007ffff7f07b40 in pkcs11_CTX_unload (ctx=<optimized out>) at ./src/p11_load.c:147
#12 0x00007ffff7f0b1ca in PKCS11_CTX_unload (ctx=<optimized out>) at ./src/p11_front.c:53
#13 0x00007ffff7f03f6b in ctx_finish (ctx=0x5555556470e0) at ./src/eng_back.c:358
#14 0x00007ffff7f01e38 in engine_finish (engine=<optimized out>) at ./src/eng_front.c:163
#15 0x00007ffff7bbd9af in engine_unlocked_finish () from /home/bigeasy/openssl/tb/openssl/build/libcrypto.so.3
#16 0x00007ffff7bbfe42 in int_cleanup_cb_doall () from /home/bigeasy/openssl/tb/openssl/build/libcrypto.so.3
#17 0x00007ffff7c0ecd4 in OPENSSL_LH_doall () from /home/bigeasy/openssl/tb/openssl/build/libcrypto.so.3
#18 0x00007ffff7bc0281 in engine_table_cleanup () from /home/bigeasy/openssl/tb/openssl/build/libcrypto.so.3
#19 0x00007ffff7bbdbf6 in engine_cleanup_cb_free () from /home/bigeasy/openssl/tb/openssl/build/libcrypto.so.3
#20 0x00007ffff7c94dc3 in OPENSSL_sk_pop_free () from /home/bigeasy/openssl/tb/openssl/build/libcrypto.so.3
#21 0x00007ffff7bbdfb9 in engine_cleanup_int () from /home/bigeasy/openssl/tb/openssl/build/libcrypto.so.3
#22 0x00007ffff7c1323e in OPENSSL_cleanup () from /home/bigeasy/openssl/tb/openssl/build/libcrypto.so.3
#23 0x00007ffff785c955 in __run_exit_handlers (status=0, listp=0x7ffff79f1840 <__exit_funcs>, ***@***.***=true, ***@***.***=true)
at ./stdlib/exit.c:111
#24 0x00007ffff785ca8a in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:141
#25 0x00005555555962a6 in main ()
While rebuilding softshm2, I noticed that Debian has other patches on top: https://sources.debian.org/src/softhsm2/2.6.1-2.1/debian/patches/
Looks like atext is problematic.
Sebastian
Yes, atexit
is problematic :(
On 2023-10-26 06:38:50 [-0700], Neil Horman wrote:
@sebastianas another option to try, if changing SoftHSM isn't feasible:
reverted SoftHSM to the unmodified version and tried. This resulted in:
Engine "pkcs11" set.
Program received signal SIGSEGV, Segmentation fault.
___pthread_rwlock_wrlock (rwlock=0x0) at ./nptl/pthread_rwlock_wrlock.c:26
26 ./nptl/pthread_rwlock_wrlock.c: No such file or directory.
(gdb) bt
#0 ___pthread_rwlock_wrlock (rwlock=0x0) at ./nptl/pthread_rwlock_wrlock.c:26
#1 0x00007ffff7c21459 in CRYPTO_THREAD_write_lock (lock=<optimized out>) at ../crypto/threads_pthread.c:110
#2 0x00007ffff7c98498 in ossl_store_unregister_loader_int ***@***.***=0x555555606a05 "org.openssl.engine") at ../crypto/store/store_register.c:260
#3 0x00007ffff7c985c5 in OSSL_STORE_unregister_loader ***@***.***=0x555555606a05 "org.openssl.engine") at ../crypto/store/store_register.c:276
#4 0x00005555555f4b90 in destroy_engine_loader () at ../apps/lib/engine_loader.c:188
#5 0x000055555559629e in apps_shutdown () at ../apps/openssl.c:93
#6 main (argc=<optimized out>, argv=<optimized out>) at ../apps/openssl.c:309
Sebastian
It's still an atexit
problem, I think.
I think they're distinct problems, And the first is definately an atexit problem, the second, I'm a bit less certain.
regarding the use of they SoftHSM patch I suggested:
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000091 in ?? ()
(gdb) bt
#0 0x0000000000000091 in ?? ()
#1 0x00007ffff77cacf0 in SecureDataManager::~SecureDataManager (this=0x555555687ea0, __in_chrg=<optimized out>) at ./src/lib/data_mgr/SecureDataManager.cpp:96
#2 0x00007ffff77caebd in SecureDataManager::~SecureDataManager (this=0x555555687ea0, __in_chrg=<optimized out>) at ./src/lib/data_mgr/SecureDataManager.cpp:102
#3 0x00007ffff77e80c1 in Token::~Token (this=0x555555680a90, __in_chrg=<optimized out>) at ./src/lib/slot_mgr/Token.cpp:69
#4 Token::~Token (this=0x555555680a90, __in_chrg=<optimized out>) at ./src/lib/slot_mgr/Token.cpp:72
#5 0x00007ffff77e7b01 in Slot::~Slot (this=0x555555680a60, __in_chrg=<optimized out>) at ./src/lib/slot_mgr/Slot.cpp:61
#6 Slot::~Slot (this=0x555555680a60, __in_chrg=<optimized out>) at ./src/lib/slot_mgr/Slot.cpp:62
#7 0x00007ffff77e712f in SlotManager::~SlotManager (this=0x555555680130, __in_chrg=<optimized out>) at ./src/lib/slot_mgr/SlotManager.cpp:93
#8 0x00007ffff77e71ed in SlotManager::~SlotManager (this=0x555555680130, __in_chrg=<optimized out>) at ./src/lib/slot_mgr/SlotManager.cpp:95
#9 0x00007ffff77816b9 in SoftHSM::C_Finalize (this=0x55555565a9b0, ***@***.***=0x0) at ./src/lib/SoftHSM.cpp:594
#10 0x00007ffff7762ac8 in C_Finalize (pReserved=0x0) at ./src/lib/main.cpp:148
#11 0x00007ffff7f07b40 in pkcs11_CTX_unload (ctx=<optimized out>) at ./src/p11_load.c:147
#12 0x00007ffff7f0b1ca in PKCS11_CTX_unload (ctx=<optimized out>) at ./src/p11_front.c:53
#13 0x00007ffff7f03f6b in ctx_finish (ctx=0x5555556470e0) at ./src/eng_back.c:358
#14 0x00007ffff7f01e38 in engine_finish (engine=<optimized out>) at ./src/eng_front.c:163
#15 0x00007ffff7bbd9af in engine_unlocked_finish () from /home/bigeasy/openssl/tb/openssl/build/libcrypto.so.3
#16 0x00007ffff7bbfe42 in int_cleanup_cb_doall () from /home/bigeasy/openssl/tb/openssl/build/libcrypto.so.3
#17 0x00007ffff7c0ecd4 in OPENSSL_LH_doall () from /home/bigeasy/openssl/tb/openssl/build/libcrypto.so.3
#18 0x00007ffff7bc0281 in engine_table_cleanup () from /home/bigeasy/openssl/tb/openssl/build/libcrypto.so.3
#19 0x00007ffff7bbdbf6 in engine_cleanup_cb_free () from /home/bigeasy/openssl/tb/openssl/build/libcrypto.so.3
#20 0x00007ffff7c94dc3 in OPENSSL_sk_pop_free () from /home/bigeasy/openssl/tb/openssl/build/libcrypto.so.3
#21 0x00007ffff7bbdfb9 in engine_cleanup_int () from /home/bigeasy/openssl/tb/openssl/build/libcrypto.so.3
#22 0x00007ffff7c1323e in OPENSSL_cleanup () from /home/bigeasy/openssl/tb/openssl/build/libcrypto.so.3
#23 0x00007ffff785c955 in __run_exit_handlers (status=0, listp=0x7ffff79f1840 <__exit_funcs>, ***@***.***=true, ***@***.***=true)
at ./stdlib/exit.c:111
#24 0x00007ffff785ca8a in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:141
#25 0x00005555555962a6 in main ()
In comparing this to the initial reported crash It looks like we got farther (i.e. we didn't crash in getSlot), because the change prevented the deletion of the SlotManager object, but it crashed later when we tried to call the singleton SoftHSM instance recycleSymmetricAlgorithm, because the SoftHSM still deleted the singleton object itself. So thats not good, but it does give us some support for the general theory that the C++ runtime is just cleaning up stuff at exit while OpenSSL still holds references to it.
regarding the second crash after the patch was removed:
Engine "pkcs11" set.
Program received signal SIGSEGV, Segmentation fault.
___pthread_rwlock_wrlock (rwlock=0x0) at ./nptl/pthread_rwlock_wrlock.c:26
26 ./nptl/pthread_rwlock_wrlock.c: No such file or directory.
(gdb) bt
#0 ___pthread_rwlock_wrlock (rwlock=0x0) at ./nptl/pthread_rwlock_wrlock.c:26
#1 0x00007ffff7c21459 in CRYPTO_THREAD_write_lock (lock=<optimized out>) at ../crypto/threads_pthread.c:110
#2 0x00007ffff7c98498 in ossl_store_unregister_loader_int ***@***.***=0x555555606a05 "org.openssl.engine") at ../crypto/store/store_register.c:260
#3 0x00007ffff7c985c5 in OSSL_STORE_unregister_loader ***@***.***=0x555555606a05 "org.openssl.engine") at ../crypto/store/store_register.c:276
#4 0x00005555555f4b90 in destroy_engine_loader () at ../apps/lib/engine_loader.c:188
#5 0x000055555559629e in apps_shutdown () at ../apps/openssl.c:93
#6 main (argc=<optimized out>, argv=<optimized out>) at ../apps/openssl.c:309
This is interesting. The stack shows we are coming from apps_shutdown, which is called from the openssl main routine, prior to any atexit handlers being called. The fact that the lock being claimed was NULL here suggests that ossl_store_destroy_loaders_int was called prior to apps_shutdown being called, which would have set that lock to NULL in the OSSL_STORE code
@sebastianas did you by any chance experiment with the second patch I suggested? I ask because, looking at it more closely, this may well have been the result of applying it. OPENSSL_cleanup winds up calling ossl_store_destroy_loaders_int, and I didn't take that into consideration (apologies).
That patch should probably look something more like this:
diff --git a/apps/openssl.c b/apps/openssl.c
index adf77096c7..8b0372673f 100644
--- a/apps/openssl.c
+++ b/apps/openssl.c
@@ -317,6 +317,7 @@ int main(int argc, char *argv[])
BIO_free_all(bio_out);
apps_shutdown();
BIO_free_all(bio_err);
+ OPENSSL_cleanup();
EXIT(ret);
}
Where we cleanup immediately before calling exit, to avoid such issues. Its not a real fix by any stretch, but it would be good to experiment with, just to see if it avoids the problem
Need to think about what a more complete solution is here: 1) Mandating that applications call OPENSSL_init / OPENSSL_cleanup would be good, but it ignores use cases in which we have significant layers between an app and openssl (i.e. app ->links to -> libA -> links to -> openssl 2) Moving to the use of c99 constructors might be intersting, but I think we have lots of platforms that don't support c99, and it may not help, as dtors will still run after atexit handlers 3) An upcall api might be helpful here, in which an engine can inform openssl that it is going away, so we know not to touch it on cleanup would be good, not sure how feasible that is, and it would require alterations in external engines 4) something else that I haven't thought of yet.
@nhorman So do you basically suggest switching command-line openssl to init with NO_ATEXIT and explicit call to OPENSSL_cleanup
?
@beldmit I think yes, thats roughly what I'm suggesting here for the purpose of diagnosing this issue further, but I recognize thats not going to be a blanket solution either.
( this is a note to myself as much as anything), we may have a way out here, based on the fact that I never see a point in which we call engine_table_unregister on shutdown, which I think is leading to this problem. Though I need to look further before I can comment on this intellegently
If we have an intention to make OPENSSL_cleanup
mandatory in 4.0 (as of https://github.com/openssl/openssl/issues/17469#issuecomment-1780915514) so probably it's worth doing this (moving the command-line openssl utility to be inited without installing the atexit hadler and explicit call of OPENSSL_cleanup) in the current master
some additional notes: so the initiially reported segfault:
Engine "pkcs11" set.
Program received signal SIGSEGV, Segmentation fault.
SlotManager::getSlot (this=0x0, slotID=slotID@entry=1769803198) at ./src/lib/slot_mgr/SlotManager.cpp:174
174 ./src/lib/slot_mgr/SlotManager.cpp: No such file or directory.
(gdb) bt
#0 SlotManager::getSlot (this=0x0, slotID=slotID@entry=1769803198) at ./src/lib/slot_mgr/SlotManager.cpp:174
#1 0x00007ffff7783fd9 in SoftHSM::C_CloseAllSessions (this=0x5555556587d0, slotID=slotID@entry=1769803198) at ./src/lib/SoftHSM.cpp:1386
#2 0x00007ffff7764414 in C_CloseAllSessions (slotID=1769803198) at ./src/lib/main.cpp:347
#3 0x00007ffff7f099ba in pkcs11_slot_unref (slot=slot@entry=0x5555556ab730) at ./src/p11_slot.c:433
#4 0x00007ffff7f09a50 in pkcs11_release_slot (slot=0x5555556698d0) at ./src/p11_slot.c:477
#5 pkcs11_release_all_slots (slots=0x5555556698d0, nslots=<optimized out>) at ./src/p11_slot.c:464
#6 0x00007ffff7f0a358 in PKCS11_release_all_slots (pctx=<optimized out>, slots=<optimized out>, nslots=<optimized out>) at ./src/p11_front.c:111
#7 0x00007ffff7f02f4e in ctx_finish (ctx=0x5555556450e0) at ./src/eng_back.c:352
#8 0x00007ffff7f00e38 in engine_finish (engine=<optimized out>) at ./src/eng_front.c:163
#9 0x00007ffff7be3953 in engine_unlocked_finish (e=0x555555657b20, unlock_for_handlers=unlock_for_handlers@entry=0) at ../crypto/engine/eng_init.c:64
#10 0x00007ffff7be5f86 in int_cleanup_cb_doall (p=0x5555556586b0) at ../crypto/engine/eng_table.c:183
#11 int_cleanup_cb_doall (p=0x5555556586b0) at ../crypto/engine/eng_table.c:177
#12 0x00007ffff7c2e10c in doall_util_fn (arg=0x0, func_arg=0x0, func=0x7ffff7be5f60 <int_cleanup_cb_doall>, use_arg=0, lh=0x555555657230) at ../crypto/lhash/lhash.c:197
#13 OPENSSL_LH_doall (lh=0x555555657230, func=func@entry=0x7ffff7be5f60 <int_cleanup_cb_doall>) at ../crypto/lhash/lhash.c:205
#14 0x00007ffff7be6365 in lh_ENGINE_PILE_doall (doall=0x7ffff7be5f60 <int_cleanup_cb_doall>, lh=<optimized out>) at ../crypto/engine/eng_local.h:159
#15 engine_table_cleanup (table=0x7ffff7e836f8 <rsa_table>) at ../crypto/engine/eng_table.c:192
#16 0x00007ffff7be3bda in engine_cleanup_cb_free (item=0x555555658a00) at ../crypto/engine/eng_lib.c:169
#17 0x00007ffff7cb0170 in OPENSSL_sk_pop_free (st=0x555555658ba0, func=0x7ffff7be3bd0 <engine_cleanup_cb_free>) at ../crypto/stack/stack.c:426
#18 0x00007ffff7be3fcd in sk_ENGINE_CLEANUP_ITEM_pop_free (freefunc=0x7ffff7be3bd0 <engine_cleanup_cb_free>, sk=<optimized out>) at ../crypto/engine/eng_local.h:48
#19 engine_cleanup_int () at ../crypto/engine/eng_lib.c:176
#20 0x00007ffff7c329ae in OPENSSL_cleanup () at ../crypto/init.c:418
#21 OPENSSL_cleanup () at ../crypto/init.c:344
#22 0x00007ffff785c955 in __run_exit_handlers (status=0, listp=0x7ffff79f1840 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true, run_dtors=run_dtors@entry=true)
at ./stdlib/exit.c:111
#23 0x00007ffff785ca8a in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:141
#24 0x00005555555962e9 in ?? ()
#25 0x00007ffff78456ca in __libc_start_call_main (main=main@entry=0x5555555961b0, argc=argc@entry=15, argv=argv@entry=0x7fffffffeb78) at ../sysdeps/nptl/libc_start_call_main.h:58
#26 0x00007ffff7845785 in __libc_start_main_impl (main=0x5555555961b0, argc=15, argv=0x7fffffffeb78, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>,
stack_end=0x7fffffffeb68) at ../csu/libc-start.c:360
#27 0x0000555555596491 in ?? ()
Is showing us that on exit, when we call OPENSSL_cleanup, we call engine_cleanup_int, which pops all the elements off the ENGINE_CLEANUP_ITEM stack, freeing them in turn. This in turn iterates over the lh_ENGINE_PILE hash table, calling int_cleanup_cb_doall for each element, which in turn calls through engine_finish->ctx_finish[pkcs11]
The problem is that, while the libp11 data that is referenced by the ENGINE_PILE is still perecty valid, the underlying data that libp11 references in the SoftHSM library is not valid (due to the fact that the C++ runtime has already deleted those objects)
In other words, openssl can never guarantee the integrity of the data held by an engine on exit. This is further complicated by the fact that we're dealing with a 2-deep stack of dsos (libp11 -> SoftHSM), meaning openssl only has some visibility into the p11 dso data, and none in the SoftHSM dso, implying we have no real way to determine whats going on in the underlying library
I think the only real way to fix this is to mandate the OPENSSL_cleanup be called by an application prior to exit to ensure proper cleanup before any other components run their atexit handler
Thats a forward looking fix however. Given that this is occuring on a released version of openssl, and that we are considering this a regression, it seems we need a less invasive fix for these versions, that don't require modifications of external components.
The only way I see to handle that at the moment, is, for released versions, to modify openssl to do one of the following: 1) Just not call engine_unlocked_finish on shutdown. This would result in leaking memory (which is arguably ok as we are shutting down), but also may cause behavioral changes in an engine that relies on a call to engine_finish to clean itself up properly. We could perhaps mitigate this if we had some way to determine what pkcs11 modules libp11 was loading that we could filter on, only preforming this operation for "known troublesome" p11 backends.
2) Catch the SIGSEGV in engine_unlocked_finish using sigsetjmp to catch the error, and skip the cleanup if we encounter a fatal signal during the cleanup process. sigsetjmp was available in POSIX 2001 and is part of the c89 standard, so that should be safe, but I'm not sure if thats truly covered by all the platforms we support.
Given the options, I dislike (2) the least, but some comments would be appreciated.
Hi,
the testsuite for libp11-0.4.12 breaks in 3.0.12, works in 3.0.11. The testsuite: https://sources.debian.org/src/libp11/0.4.12-1/debian/tests/engine/ The segfault occurs while generating a certificate request doing
The segfault:
This can bisected to commit 02b87cc189fa8cae8d6f69d68449a9aecc0e34f0
Sebastian