saslauthd crashing in lmdb

A user has reported that saslauthd is crashing fairly regularly, and has provided a core file.

The stack from that core shows that we're dying with a SEGV at off set 0x50 in mdb_env_share_locks:

> ::status
debugging core file of saslauthd (64-bit) from mail
file: /opt/ooce/sbin/saslauthd
initial argv: /opt/ooce/sbin/saslauthd -a sasldb -c -m /var/run/saslauthd
threading model: native threads
status: process terminated by SIGSEGV (Segmentation Fault), addr=20

> $C
fffffc7feef30990 liblmdb.so`mdb_env_share_locks+0x50()
fffffc7feef30a00 liblmdb.so`mdb_env_open+0x306()
fffffc7feef30a90 do_open+0x23f()
fffffc7feef30b40 _sasldb_getdata+0x10d()
fffffc7feef31230 auth_sasldb+0xd1()
fffffc7feef314f0 do_auth+0x7d()
fffffc7feef31d80 do_request+0x2d2()
0000000000000000 libc.so.1`__door_return+0x50()

The SEGV is at address 0x20, and a low address like that usually indicates a NULL pointer dereference, where we're attempting to look at a member of a struct at that offset.

Unfortunately this binary wasn't compiled with all debugging features - it was actually built with clang which is a bit less useful from a debugging perspective than if it was built with the illumos-patched gcc, but let's see what we can get.

If we look at the disassembly of mdb_env_share_locks up to the address where we crashed:

> liblmdb.so`mdb_env_share_locks::dis
liblmdb.so`mdb_env_share_locks: pushq  %rbp
liblmdb.so`mdb_env_share_locks+1:       movq   %rsp,%rbp
liblmdb.so`mdb_env_share_locks+4:       pushq  %r15
liblmdb.so`mdb_env_share_locks+6:       pushq  %r14
liblmdb.so`mdb_env_share_locks+8:       pushq  %r13
liblmdb.so`mdb_env_share_locks+0xa:     pushq  %r12
liblmdb.so`mdb_env_share_locks+0xc:     pushq  %rbx
liblmdb.so`mdb_env_share_locks+0xd:     subq   $0x48,%rsp
liblmdb.so`mdb_env_share_locks+0x11:    movq   %rdi,%r14
liblmdb.so`mdb_env_share_locks+0x14:    movq   +0x12e63(%rip),%rax      <0xfffffc7fee0168b0>
liblmdb.so`mdb_env_share_locks+0x1b:    movq   (%rax),%rax
liblmdb.so`mdb_env_share_locks+0x1e:    movq   %rax,-0x30(%rbp)
liblmdb.so`mdb_env_share_locks+0x22:    movq   0x48(%rdi),%rax
liblmdb.so`mdb_env_share_locks+0x26:    movq   0x80(%rax),%rax
liblmdb.so`mdb_env_share_locks+0x2d:    movq   0x50(%rdi),%rcx
liblmdb.so`mdb_env_share_locks+0x31:    xorl   %edx,%edx
liblmdb.so`mdb_env_share_locks+0x33:    cmpq   0x80(%rcx),%rax
liblmdb.so`mdb_env_share_locks+0x3a:    movq   %rsi,%rbx
liblmdb.so`mdb_env_share_locks+0x3d:    setb   %dl
liblmdb.so`mdb_env_share_locks+0x40:    movq   0x48(%rdi,%rdx,8),%rax
liblmdb.so`mdb_env_share_locks+0x45:    movq   0x80(%rax),%rax
liblmdb.so`mdb_env_share_locks+0x4c:    movq   0x40(%rdi),%rcx
liblmdb.so`mdb_env_share_locks+0x50:    movq   %rax,0x20(%rcx)

We're looking 20 (hex) bytes into whatever is in %rcx, is that NULL?

> <rcx=J
                0

Yep, let's try and work out which bit of source corresponds to this. It's early in the function, which is nice:

/** Downgrade the exclusive lock on the region back to shared */
static int ESECT
mdb_env_share_locks(MDB_env *env, int *excl)                                    {                                                                                       int rc = 0;
        MDB_meta *meta = mdb_env_pick_meta(env);

        env->me_txns->mti_txnid = meta->mm_txnid;

That dereference of me_txns looks likely. Is that NULL? Even without more debugging information available, we have died early enough in the function that the first argument is still in %rdi.

> <rdi::print MDB_env
{
    me_fd = 0x6
    me_lfd = 0xffffffff
    me_mfd = 0xffffffff
    me_flags = 0x30024000
    me_psize = 0x1000
    me_os_psize = 0x1000
    me_maxreaders = 0x7e
    me_close_readers = 0
    me_numdbs = 0x2
    me_maxdbs = 0x2
    me_pid = 0x876
    me_path = 0x11d40c0 "/var/opt/ooce/sasl2/sasldb2"
    me_map = 0xfffffc7fee159000
    me_txns = 0
    me_metas = [ 0xfffffc7fee159010, 0xfffffc7fee15a010 ]
    me_pbuf = 0
    me_txn = 0
    me_txn0 = 0
    me_mapsize = 0x100000
    me_size = 0
    me_maxpg = 0x100
    me_dbxs = 0x11f6510
    me_dbflags = 0x11d3040
    me_dbiseqs = 0x11d3060
    me_txkey = 0x3
    me_pgoldest = 0
    me_pgstate = {
        mf_pghead = 0
        mf_pglast = 0
    }
    me_dpages = 0
    me_free_pgs = 0
    me_dirty_list = 0
    me_maxfree_1pg = 0x1fd
    me_nodemax = 0x7f6
    me_live_reader = 0
    me_userctx = 0
    me_assert_func = 0
}

and yes, me_txns is indeed NULL (and the rest of the data, particularly me_path, looks ok, so we're likely looking at the right address)

What I don't know is why, and how this could be an intermittent problem. Several other places in the code do check that me_txns is not NULL before de-referencing it, but not here. That could mean that it should never be NULL at this point and there is a bug elsewhere, or it could be a missing check - without a more complete understanding of the library it's hard to determine.

As for next steps here, I would suggest looking to see if there have been changes/fixes/bugs reported against lmdb in this area, if there is a new version that might have a fix, and otherwise reporting a bug against the lmdb project. It is possible to do more investigation here with dtrace and other tools to help build up a picture of how things work around this, but it's time consuming and can be a bit of a learning curve when you jump into unfamiliar code like this.

We'll get there if necessary, but let's see if there's an available fix first.

omniosorg / omnios-extra

saslauthd crashing in lmdb #1401