Hanging NodeJS process - Githubissues

aikoven commented 4 years ago

We're facing an issue that our services are constantly hanging, by which I mean that any JS code stops executing. This doesn't happen in our tests, nor at our staging environment, only on production. So I guess it's some random race condition that happens more often with increased load.

These services are only doing reads from FDB. This includes single key reads, range reads, and watches.

I tried to wrap all calls to the library with logs, but couldn't figure out a single point where it stops.

We're at NodeJS v12.16.1, FDB client 6.2.15. Running console.log(require("foundationdb").modType) returns napi.

I'm not very good at debugging native code, but here's a stack trace from a hanging process:

Thread 12 (LWP 14382):
#0  0x00007f5fd2d4d360 in pthread_cond_wait@@GLIBC_2.3.2 () from target:/lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#1  0x000000000131caa9 in uv_cond_wait (cond=<optimized out>, mutex=<optimized out>) at ../deps/uv/src/unix/thread.c:781
No locals.
#2  0x00000000009c58cd in napi_call_threadsafe_function ()
No symbol table info available.
#3  0x00007f5fbbdfc315 in ?? () from target:/opt/app/node_modules/foundationdb/prebuilds/linux-x64/node.napi.node
No symbol table info available.
#4  0x00007f5fbb3de1e7 in ?? () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#5  0x00007f5fbb414359 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#6  0x00007f5fbb427826 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#7  0x00007f5fbb5a3c68 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#8  0x00007f5fbb611181 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#9  0x00007f5fbb6115e3 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#10 0x00007f5fbb5a3c68 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#11 0x00007f5fbb619d8b in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#12 0x00007f5fbb61d8c1 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#13 0x00007f5fbb61ce07 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#14 0x00007f5fbb61d6ad in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#15 0x00007f5fbb61d77d in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#16 0x00007f5fbb5a3c68 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#17 0x00007f5fbb589e3f in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#18 0x00007f5fbb58ac8d in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#19 0x00007f5fbb5b4dd8 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#20 0x00007f5fbb577ba4 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#21 0x00007f5fbb577cdf in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#22 0x00007f5fbb5520e0 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#23 0x00007f5fbb551cf8 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#24 0x00007f5fbb551e2c in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#25 0x00007f5fbb58b098 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#26 0x00007f5fbb482688 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#27 0x00007f5fbb4b3b9b in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#28 0x00007f5fbb69f0ae in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#29 0x00007f5fbb69f1b5 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#30 0x00007f5fbb7aae08 in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#31 0x00007f5fbb52347a in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#32 0x00007f5fbb41961c in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#33 0x00007f5fbb3f54bc in fdb_transaction_set_option () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#34 0x00007f5fbb3dc659 in fdb_run_network () from target:/usr/lib/libfdb_c.so
No symbol table info available.
#35 0x00007f5fbbdf8c3e in ?? () from target:/opt/app/node_modules/foundationdb/prebuilds/linux-x64/node.napi.node
No symbol table info available.
#36 0x00007f5fd2d476ba in start_thread () from target:/lib/x86_64-linux-gnu/libpthread.so.0
No symbol table info available.
#37 0x00007f5fd2a7d41d in clone () from target:/lib/x86_64-linux-gnu/libc.so.6
No symbol table info available.

Please tell me if there's any additional info I can provide. Thanks in advance!

josephg commented 4 years ago

Hm thats no good.

How often is it hanging? Like, after 20 minutes or after a few hours? Is it random or does it take about the same amount of time to happen each time?
Is the code only doing "normal" reads or range reads / watches / etc? (I'm trying to imagine what the code does so I can try and repro it)
When it hangs, is there any abnormal memory footprint or anything like that?

aikoven commented 4 years ago

It takes between 1 to 20 minutes to hang. It's happening more often for services that do more reads.
There are basically two kinds of transactions: the first one is scanning event stream, which is:
- get to get the versionstamp of the last event in the stream
- getRangeRaw with StreamingMode.Iterator to read a batch of events up to that versionstamp
- watch if we reached the last event
This is wrapped in while (true) (outside of transaction) to read the stream. The second kind of transaction just loads an array of events by an array of their versionstamps by issuing concurrent gets.
Nope, no excess memory nor CPU usage.

I fear that it could be hard to reproduce as is since we didn't experience anything like that on our staging (which is a copy of our production environment but with little load), nor during load testing.

Maybe there's something I could do on my side to aid debugging? Like running a debug build of the library.

josephg commented 4 years ago

Thanks for that. I'm not sure what else a debugging build would tell me in this case. The stack trace you provided above is pretty good - it looks like whats happening is:

We call napi_call_threadsafe_function from the fdb network thread to send a signal back to the javascript code
That adds information about the query to a queue - which I made 16 elements in size. That calls this function in nodejs
But the queue is full, waiting on elements to get popped off the queue as they're processed in javascript. When the queue is full, the call to call_threadsafe_function blocks. At some point we expect the javascript code to finish processing the elements in question, but that isn't happening for some reason.

Anyway, I've been able to reproduce it already (!!) by making the call_threadsafe_function queue 1 element long, then running this little stress tester. It stutters, and after awhile just hangs.

const fdb = require('.')
fdb.setAPIVersion(600);

const db = fdb.openSync();

let time = 0
setInterval(() => { console.log('still alive', time++) }, 2000)

;(async () => {
  await db.set('x', 'hi')

  const thread = async (id) => {
    console.log('starting thread', id)
    for (let i = 0; i < 100000; i++) {
      await db.get('x')
      console.log(i, id)
    }
    console.log('thread done', id)
  }

  for (let i = 0; i < 50; i++) {
    thread(i)
  }
})()

josephg commented 4 years ago

Ah. Hahahaha I think I see the problem 😏

The issue is that sometimes foundationdb's future objects get resolved immediately, in the current thread. In particular, this happens with calls that are idempotent. For example, committing a read-only transaction. The order of operations which causes the issue is this:

Nodejs calls a series of methods in fdb. Foundationdb answers the queries, and fills up the queue with results for nodejs
The javascript code to process one of those queries makes an idempotent call to fdb (like txn.commit()). This is resolved immediately by foundationdb on the main thread. The work to pass the result back to javascript is added to the end of the queue. But the queue is full, so the call blocks...

... Nodejs's main thread is now waiting on the queue to have room. But the queue will only have room when it finishes processing more results from fdb... which it can't do while its blocked waiting on the queue to have room. Cycle (and resulting deadlock) is complete and the process hangs.

josephg commented 4 years ago

📦 foundationdb@0.10.7. I'm 90% sure this will fix your issue. Give it a try and let me know!

aikoven commented 4 years ago

This is great, thank you! I will check it right now.

aikoven commented 4 years ago

The issue seems to be gone now.

I appreciate your help very much!

josephg commented 4 years ago

Awesome - glad to hear it! :)

josephg / node-foundationdb

Hanging NodeJS process #41