josephg / node-foundationdb

Modern Node.js FoundationDB bindings
Other
115 stars 17 forks source link

Hanging NodeJS process #41

Closed aikoven closed 4 years ago

aikoven commented 4 years ago

We're facing an issue that our services are constantly hanging, by which I mean that any JS code stops executing. This doesn't happen in our tests, nor at our staging environment, only on production. So I guess it's some random race condition that happens more often with increased load.

These services are only doing reads from FDB. This includes single key reads, range reads, and watches.

I tried to wrap all calls to the library with logs, but couldn't figure out a single point where it stops.

We're at NodeJS v12.16.1, FDB client 6.2.15. Running console.log(require("foundationdb").modType) returns napi.

I'm not very good at debugging native code, but here's a stack trace from a hanging process:

Thread 12 (LWP 14382):
#0  0x00007f5fd2d4d360 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x000000000131caa9 in uv_cond_wait (cond=<optimized out>, mutex=<optimized out>) at ../deps/uv/src/unix/thread.c:781
No locals.
#2  0x00000000009c58cd in napi_call_threadsafe_function ()
No symbol table info available.
#3  0x00007f5fbbdfc315 in ?? () from target:/opt/app/node_modules/foundationdb/prebuilds/linux-x64/node.napi.node
No symbol table info available.
#4  0x00007f5fbb3de1e7 in ?? () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#5  0x00007f5fbb414359 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#6  0x00007f5fbb427826 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#7  0x00007f5fbb5a3c68 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#8  0x00007f5fbb611181 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#9  0x00007f5fbb6115e3 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#10 0x00007f5fbb5a3c68 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#11 0x00007f5fbb619d8b in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#12 0x00007f5fbb61d8c1 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#13 0x00007f5fbb61ce07 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#14 0x00007f5fbb61d6ad in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#15 0x00007f5fbb61d77d in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#16 0x00007f5fbb5a3c68 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#17 0x00007f5fbb589e3f in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#18 0x00007f5fbb58ac8d in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#19 0x00007f5fbb5b4dd8 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#20 0x00007f5fbb577ba4 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#21 0x00007f5fbb577cdf in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#22 0x00007f5fbb5520e0 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#23 0x00007f5fbb551cf8 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#24 0x00007f5fbb551e2c in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#25 0x00007f5fbb58b098 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#26 0x00007f5fbb482688 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#27 0x00007f5fbb4b3b9b in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#28 0x00007f5fbb69f0ae in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#29 0x00007f5fbb69f1b5 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#30 0x00007f5fbb7aae08 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#31 0x00007f5fbb52347a in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#32 0x00007f5fbb41961c in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#33 0x00007f5fbb3f54bc in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#34 0x00007f5fbb3dc659 in fdb_run_network () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#35 0x00007f5fbbdf8c3e in ?? () from target:/opt/app/node_modules/foundationdb/prebuilds/linux-x64/node.napi.node
No symbol table info available.
#36 0x00007f5fd2d476ba in start_thread () from target:/lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#37 0x00007f5fd2a7d41d in clone () from target:/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

Please tell me if there's any additional info I can provide. Thanks in advance!

josephg commented 4 years ago

Hm thats no good.

aikoven commented 4 years ago

I fear that it could be hard to reproduce as is since we didn't experience anything like that on our staging (which is a copy of our production environment but with little load), nor during load testing.

Maybe there's something I could do on my side to aid debugging? Like running a debug build of the library.

josephg commented 4 years ago

Thanks for that. I'm not sure what else a debugging build would tell me in this case. The stack trace you provided above is pretty good - it looks like whats happening is:

Anyway, I've been able to reproduce it already (!!) by making the call_threadsafe_function queue 1 element long, then running this little stress tester. It stutters, and after awhile just hangs.

const fdb = require('.')
fdb.setAPIVersion(600);

const db = fdb.openSync();

let time = 0
setInterval(() => { console.log('still alive', time++) }, 2000)

;(async () => {
  await db.set('x', 'hi')

  const thread = async (id) => {
    console.log('starting thread', id)
    for (let i = 0; i < 100000; i++) {
      await db.get('x')
      console.log(i, id)
    }
    console.log('thread done', id)
  }

  for (let i = 0; i < 50; i++) {
    thread(i)
  }
})()
josephg commented 4 years ago

Ah. Hahahaha I think I see the problem 😏

The issue is that sometimes foundationdb's future objects get resolved immediately, in the current thread. In particular, this happens with calls that are idempotent. For example, committing a read-only transaction. The order of operations which causes the issue is this:

... Nodejs's main thread is now waiting on the queue to have room. But the queue will only have room when it finishes processing more results from fdb... which it can't do while its blocked waiting on the queue to have room. Cycle (and resulting deadlock) is complete and the process hangs.

josephg commented 4 years ago

📦 foundationdb@0.10.7. I'm 90% sure this will fix your issue. Give it a try and let me know!

aikoven commented 4 years ago

This is great, thank you! I will check it right now.

aikoven commented 4 years ago

The issue seems to be gone now.

I appreciate your help very much!

josephg commented 4 years ago

Awesome - glad to hear it! :)