A user has reported that saslauthd is crashing fairly regularly, and has provided a core file.
The stack from that core shows that we're dying with a SEGV at off set 0x50 in mdb_env_share_locks:
> ::status
debugging core file of saslauthd (64-bit) from mail
file: /opt/ooce/sbin/saslauthd
initial argv: /opt/ooce/sbin/saslauthd -a sasldb -c -m /var/run/saslauthd
threading model: native threads
status: process terminated by SIGSEGV (Segmentation Fault), addr=20
> $C
fffffc7feef30990 liblmdb.so`mdb_env_share_locks+0x50()
fffffc7feef30a00 liblmdb.so`mdb_env_open+0x306()
fffffc7feef30a90 do_open+0x23f()
fffffc7feef30b40 _sasldb_getdata+0x10d()
fffffc7feef31230 auth_sasldb+0xd1()
fffffc7feef314f0 do_auth+0x7d()
fffffc7feef31d80 do_request+0x2d2()
0000000000000000 libc.so.1`__door_return+0x50()
The SEGV is at address 0x20, and a low address like that usually indicates a NULL pointer dereference, where we're attempting to look at a member of a struct at that offset.
Unfortunately this binary wasn't compiled with all debugging features - it was actually built with clang which is a bit less useful from a debugging perspective than if it was built with the illumos-patched gcc, but let's see what we can get.
If we look at the disassembly of mdb_env_share_locks up to the address where we crashed:
We're looking 20 (hex) bytes into whatever is in %rcx, is that NULL?
> <rcx=J
0
Yep, let's try and work out which bit of source corresponds to this. It's early in the function, which is nice:
/** Downgrade the exclusive lock on the region back to shared */
static int ESECT
mdb_env_share_locks(MDB_env *env, int *excl) { int rc = 0;
MDB_meta *meta = mdb_env_pick_meta(env);
env->me_txns->mti_txnid = meta->mm_txnid;
That dereference of me_txns looks likely. Is that NULL? Even without more debugging information available, we have died early enough in the function that the first argument is still in %rdi.
and yes, me_txns is indeed NULL (and the rest of the data, particularly me_path, looks ok, so we're likely looking at the right address)
What I don't know is why, and how this could be an intermittent problem. Several other places in the code do check that me_txns is not NULL before de-referencing it, but not here. That could mean that it should never be NULL at this point and there is a bug elsewhere, or it could be a missing check - without a more complete understanding of the library it's hard to determine.
As for next steps here, I would suggest looking to see if there have been changes/fixes/bugs reported against lmdb in this area, if there is a new version that might have a fix, and otherwise reporting a bug against the lmdb project. It is possible to do more investigation here with dtrace and other tools to help build up a picture of how things work around this, but it's time consuming and can be a bit of a learning curve when you jump into unfamiliar code like this.
We'll get there if necessary, but let's see if there's an available fix first.
A user has reported that saslauthd is crashing fairly regularly, and has provided a core file.
The stack from that core shows that we're dying with a SEGV at off set
0x50
inmdb_env_share_locks
:The SEGV is at address 0x20, and a low address like that usually indicates a NULL pointer dereference, where we're attempting to look at a member of a struct at that offset.
Unfortunately this binary wasn't compiled with all debugging features - it was actually built with clang which is a bit less useful from a debugging perspective than if it was built with the illumos-patched gcc, but let's see what we can get.
If we look at the disassembly of mdb_env_share_locks up to the address where we crashed:
We're looking 20 (hex) bytes into whatever is in %rcx, is that NULL?
Yep, let's try and work out which bit of source corresponds to this. It's early in the function, which is nice:
That dereference of me_txns looks likely. Is that NULL? Even without more debugging information available, we have died early enough in the function that the first argument is still in %rdi.
and yes,
me_txns
is indeed NULL (and the rest of the data, particularlyme_path
, looks ok, so we're likely looking at the right address)What I don't know is why, and how this could be an intermittent problem. Several other places in the code do check that me_txns is not NULL before de-referencing it, but not here. That could mean that it should never be NULL at this point and there is a bug elsewhere, or it could be a missing check - without a more complete understanding of the library it's hard to determine.
As for next steps here, I would suggest looking to see if there have been changes/fixes/bugs reported against lmdb in this area, if there is a new version that might have a fix, and otherwise reporting a bug against the lmdb project. It is possible to do more investigation here with dtrace and other tools to help build up a picture of how things work around this, but it's time consuming and can be a bit of a learning curve when you jump into unfamiliar code like this.
We'll get there if necessary, but let's see if there's an available fix first.