Closed aikoven closed 4 years ago
Hm thats no good.
There are basically two kinds of transactions: the first one is scanning event stream, which is:
get
to get the versionstamp of the last event in the streamgetRangeRaw
with StreamingMode.Iterator
to read a batch of events up to that versionstampwatch
if we reached the last eventThis is wrapped in while (true)
(outside of transaction) to read the stream.
The second kind of transaction just loads an array of events by an array of their versionstamps by issuing concurrent get
s.
I fear that it could be hard to reproduce as is since we didn't experience anything like that on our staging (which is a copy of our production environment but with little load), nor during load testing.
Maybe there's something I could do on my side to aid debugging? Like running a debug build of the library.
Thanks for that. I'm not sure what else a debugging build would tell me in this case. The stack trace you provided above is pretty good - it looks like whats happening is:
napi_call_threadsafe_function
from the fdb network thread to send a signal back to the javascript codeAnyway, I've been able to reproduce it already (!!) by making the call_threadsafe_function queue 1 element long, then running this little stress tester. It stutters, and after awhile just hangs.
const fdb = require('.')
fdb.setAPIVersion(600);
const db = fdb.openSync();
let time = 0
setInterval(() => { console.log('still alive', time++) }, 2000)
;(async () => {
await db.set('x', 'hi')
const thread = async (id) => {
console.log('starting thread', id)
for (let i = 0; i < 100000; i++) {
await db.get('x')
console.log(i, id)
}
console.log('thread done', id)
}
for (let i = 0; i < 50; i++) {
thread(i)
}
})()
Ah. Hahahaha I think I see the problem 😏
The issue is that sometimes foundationdb's future objects get resolved immediately, in the current thread. In particular, this happens with calls that are idempotent. For example, committing a read-only transaction. The order of operations which causes the issue is this:
txn.commit()
). This is resolved immediately by foundationdb on the main thread. The work to pass the result back to javascript is added to the end of the queue. But the queue is full, so the call blocks...... Nodejs's main thread is now waiting on the queue to have room. But the queue will only have room when it finishes processing more results from fdb... which it can't do while its blocked waiting on the queue to have room. Cycle (and resulting deadlock) is complete and the process hangs.
📦 foundationdb@0.10.7. I'm 90% sure this will fix your issue. Give it a try and let me know!
This is great, thank you! I will check it right now.
The issue seems to be gone now.
I appreciate your help very much!
Awesome - glad to hear it! :)
We're facing an issue that our services are constantly hanging, by which I mean that any JS code stops executing. This doesn't happen in our tests, nor at our staging environment, only on production. So I guess it's some random race condition that happens more often with increased load.
These services are only doing reads from FDB. This includes single key reads, range reads, and watches.
I tried to wrap all calls to the library with logs, but couldn't figure out a single point where it stops.
We're at NodeJS
v12.16.1
, FDB client6.2.15
. Runningconsole.log(require("foundationdb").modType)
returnsnapi
.I'm not very good at debugging native code, but here's a stack trace from a hanging process:
Please tell me if there's any additional info I can provide. Thanks in advance!