jjhoughton / napi-ldap

MIT License
7 stars 5 forks source link

SIGABRT after socket is disconnected/reconnected #10

Closed cout closed 1 year ago

cout commented 3 years ago

As noted in 457c44cf, a crash can result if uv_poll_stop is called more than once on the same handle. That commit added a workaround for the problem, but I do not believe it is sufficient. We tested napi-ldap 1.0.2 (which afaict should have the workaround) with nodejs 14.17.0.

Steps to reproduce:

As my colleague @pfree pointed out, setting the handle to NULL avoids the crash in node-LDAP, though we realize now the original patch leaked the handle. The following patch (which is not extensively tested) does avoid the crash in napi-ldap:

@@ -1247,7 +1247,11 @@ on_disconnect (LDAP * ld, Sockbuf * sb, struct ldap_conncb *ctx)
     return;

   if (ldap_cnx->handle)
-    uv_poll_stop (ldap_cnx->handle);
+    {
+      uv_poll_stop (ldap_cnx->handle);
+      free(ldap_cnx->handle);
+      ldap_cnx->handle = NULL;
+    }

   status = napi_open_handle_scope (env, &scope);
   assert (status == napi_ok);

I do not know why checking the result of uv_loop_alive was not enough in itself to prevent the crash; it seems like it would be. In any event, since handle is not NULL when on_connect is called, uv_poll_init is never called to bind the handle to the new file descriptor. In the instances where the above test did not cause node to crash, I believe it explains why napi-ldap stopped handling incoming messages.

One potential issue with the above patch is I suspect setting the handle to NULL would interfere with the previous check that checks for disconnect getting called twice (I do not observe that behavior; in my test, on_disconnect gets called exactly once).

Perhaps disconnect is getting called twice due to this same bug, if libuv is being put into an inconsistent state? If so that check may be unnecessary.

I am also questioning the call to uv_poll_stop in on_connect -- other than the first connection, is there ever a case where on_connect would be called without a call to on_disconnect?

jjhoughton commented 3 years ago

Thanks @cout for the excellent writeup, apologies for not getting back to you sooner, I must have missed the email notification. I will try your steps to reproduce. I'm a little reluctant to blindly apply your patch just yet as I don't currently understand why this is still an issue. I'll be in-touch soon. As to your question

I am also questioning the call to uv_poll_stop in on_connect -- other than the first connection, is there ever a case where on_connect would be called without a call to on_disconnect?

This is only there because it was present in jeremy childs original node-ldap code. I don't recall any instance where on_connect was being called twice without disconnect being called in between, so you're probably correct you could probably just remove the else statement. I suspect at the time I likely was reluctant to change something I assumed worked.

cout commented 3 years ago

We have been running with the above patch in our test environment for a few days now, and we are still seeing crashes shortly after a disconnect/reconnect. We have seen one crash in the garbage collector and one crash in uv_async_init, which hints at possible memory corruption. So this patch still needs some work before it can be applied.

cout commented 3 years ago

The following script reproduces the problem with only one socket:

const child_process = require('child_process');
const fs = require('fs');
const LDAP = require('napi-ldap');  // or require('./main')

const port = 12345;

const server_cfg = { port: port, userLoginAttribute: 'cn', searchBase: 'dc=test' };
const users = [ { 'dn': 'cn=user,dc=test', 'cn': 'user-login' } ];

function start_ldap_server() {
  return new Promise((resolve, reject) => {
    fs.writeFileSync('conf.json', JSON.stringify(server_cfg));
    fs.writeFileSync('users.json', JSON.stringify(users));

    const server_args = [ 'node_modules/ldap-server-mock/server.js', '--conf=conf.json', '--database=users.json', ];
    const server = child_process.spawn('node', server_args);

    var resolved = false;
    server.stdout.on('data', (data) => { console.log(`ldap server stdout: ${data}`); resolve(server); });
    server.stderr.on('data', (data) => { console.log(`ldap server stderr: ${data}`); });
    server.on('exit', function(code) { if (!resolved) { reject(new Error(`ldap server exited with code ${code}`)); }
    });
  });
}

function stop_ldap_server(server, client) {
  return new Promise((resolve, reject) => {
    client.options.disconnect = function() { resolve(); };
    server.kill();
  });
}

function bind(ldap, options) {
  return new Promise((resolve, reject) => {
    ldap.bind(options, (err) => {
      if (err) { reject(new Error(err)); } else { resolve(); }
    });
  });
}

async function main() {
  let server = await start_ldap_server();

  const uri = `ldap://127.0.0.1:${port}`;
  const ldap = new LDAP({ uri: uri, base: 'dc=test' });

  await bind(ldap, { binddn: 'cn=user,dc=test', password: 't3st' });
  await stop_ldap_server(server, ldap);
  server = await start_ldap_server();
  await bind(ldap, { binddn: 'cn=user,dc=test', password: 't3st' });
  await stop_ldap_server(server, ldap);
}

main().catch((err) => {
  console.error(err);
});

Using node v12.9.1:

$ ./test.js
ldap server stdout: LDAP server listening on port 12345

ldap server stdout: LDAP server listening on port 12345

Error: LDAPError: Timeout
    at /home/cout/git/napi-ldap/test.js:39:25
    at LDAP.searchTimeout (/home/cout/git/napi-ldap/main.js:342:7)
    at listOnTimeout (internal/timers.js:531:17)
    at processTimers (internal/timers.js:475:7)
node: ../deps/uv/src/unix/core.c:929: uv__io_stop: Assertion `loop->watchers[w->fd] == w' failed.

Using node v14.17.0 I see a timeout, but no assertion:

ldap server stdout: LDAP server listening on port 12345

ldap server stdout: LDAP server listening on port 12345

Error: LDAPError: Timeout
    at /home/cout/git/napi-ldap/test.js:37:25
    at LDAP.searchTimeout (/home/cout/git/napi-ldap/main.js:342:7)
    at listOnTimeout (internal/timers.js:555:17)
    at processTimers (internal/timers.js:498:7)

Adding some logging to cnx.c I see that I am indeed getting a different file descriptor on the second connection attempt:

ldap server stdout: LDAP server listening on port 12345

on_connect; fd=22
on_connect calling uv_poll_init
on_connect calling uv_poll_start
on_disconnect
on_disconnect calling uv_poll_stop
ldap server stdout: LDAP server listening on port 12345

on_connect; fd=19
on_connect calling uv_poll_stop
on_connect calling uv_poll_start
Error: LDAPError: Timeout
    at /home/cout/git/napi-ldap/test.js:37:25
    at LDAP.searchTimeout (/home/cout/git/napi-ldap/main.js:342:7)
    at listOnTimeout (internal/timers.js:555:17)
    at processTimers (internal/timers.js:498:7)
cout commented 3 years ago

Reading https://stackoverflow.com/a/47270400/244083 it sounds like the correct thing to do is to call free from the callback passed to uv_close.

I applied this patch:

diff --git a/cnx.c b/cnx.c
index e513487..60da1c8 100644
--- a/cnx.c
+++ b/cnx.c
@@ -1231,6 +1231,11 @@ on_connect (LDAP * ld, Sockbuf * sb,
 }

 static void
+on_close (uv_handle_t * handle) {
+  free(handle);
+}
+
+static void
 on_disconnect (LDAP * ld, Sockbuf * sb, struct ldap_conncb *ctx)
 {
   struct ldap_cnx *ldap_cnx = (struct ldap_cnx *) ctx->lc_arg;
@@ -1247,7 +1252,11 @@ on_disconnect (LDAP * ld, Sockbuf * sb, struct ldap_conncb *ctx)
     return;

   if (ldap_cnx->handle)
-    uv_poll_stop (ldap_cnx->handle);
+    {
+      uv_poll_stop (ldap_cnx->handle);
+      uv_close (ldap_cnx->handle, on_close);
+      ldap_cnx->handle = NULL;
+    }

   status = napi_open_handle_scope (env, &scope);
   assert (status == napi_ok);

and the above test exits with no assertion and no timeout (tested both node v12.9.1 and v14.17.0):

ldap server stdout: LDAP server listening on port 12345

ldap server stdout: LDAP server listening on port 12345
wpatrik14 commented 2 years ago

Hi, We face similar issues after disconnecting from the server. Do you have any update (or ETA) on the fix or any workaround we could apply meanwhile?