Closed shayneczyzewski closed 1 year ago
The relevant pg
code is here:
con.once('end', () => {
const error = this._ending ? new Error('Connection terminated') : new Error('Connection terminated unexpectedly')
clearTimeout(this.connectionTimeoutHandle)
this._errorAllQueries(error)
if (!this._ending) {
// if the connection is ended without us calling .end()
// on this client then we have an unexpected disconnection
// treat this as an error unless we've already emitted an error
// during connection.
if (this._connecting && !this._connectionError) {
if (this._connectionCallback) {
this._connectionCallback(error)
} else {
this._handleErrorEvent(error)
}
} else if (!this._connectionError) {
this._handleErrorEvent(error)
}
}
process.nextTick(() => {
this.emit('end')
})
})
I can also ask them directly, but was mainly curious if you have seen this before or anything jumps out at you for me to try and narrow in on. Thanks again!
Lastly, tomorrow I'll get a super minimal example app for reproduction setup. šš»
Here is a minimally reproducible example (including logs) using the latest pg-boss
and Node 19: https://github.com/shayneczyzewski/pgboss-fly
In Wasp, we use Node 18 and pg-boss
version 8.0.0 with the same results. Thanks for helping to take a quick look to see if it's something pg-boss
can handle, or if I should head over to pg
or Fly.io to ask about! šš»
Thanks for the sample project. I'll try to repro.
This looks related: https://github.com/brianc/node-postgres/issues/2439
I've been able to repro node crashing just by stopping postgres locally. However, I can see pg-boss logging errors for a while before the crash since the workers haven't been told to stop trying. Eventually, it does crash, but from an unhandled error that's internal in pg.
For example, if you change your example to the following, node doesn't crash because all workers have been stopped.
boss.on('error', error => {
console.error(error)
boss.stop({ graceful: false }).catch(err => console.error(err))
});
However, I wouldn't recommend using this severe of an error handler since not all error events will have the same severity. If you add a condition to look for these fatal errors from pg, you could conditionally exit, which may resolve this issue in fly.
Regarding https://github.com/timgit/pg-boss/issues/365#issuecomment-1414692855, I added the following but node still crashed, so that's not the magic solution. :(
pool.on('connect', client => client.on('error', error => this.emit('error', error)))
Hey @timgit thanks so much for jumping in on this. I will review the linked pg
Issue (it does feel related), and try to getting it repro locally (I guess I didn't stop postgres abruptly enough :D). Once I can repro locally I will try some things out too to see if we can catch that error. Appreciate the attention to this, and have a great weekend!
I was able to repro locally and confirm what you noted- that solution does not work. With that code in place I tried the following.
All the fetch
calls seem to be producing those outputs like:
{
errno: -61,
code: 'ECONNREFUSED',
syscall: 'connect',
address: '::1',
port: 5432,
message: 'connect ECONNREFUSED ::1:5432 (Queue: __pgboss__maintenance, Worker: 8cf3d2e9-d85d-40ef-b156-e5b94c506458)',
stack: 'Error: connect ECONNREFUSED ::1:5432\n' +
' at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1300:16)',
queue: '__pgboss__maintenance',
worker: '8cf3d2e9-d85d-40ef-b156-e5b94c506458'
}
It seems to blow up for me when Timekeeper.getCronTime
is called. I wonder why that call this.db.executeSql
isn't getting trapped like the one for the workers. š¤ Anyways, will keep poking around! šš»
Hey @timgit, I was just starting to think about it more and I'm wondering what we can do as a stopgap in this scenario. Do you think the best case is for me to try and catch/decode the type of pg
error and shut pg-boss
down if it is critical (sort of what you replied in your first comment)? Or, assuming we even can fix the pg
error leak, would just letting it keep trying on this type of error be the ideal state (maybe not)? Do we need to change the internal pg-boss
DB status of the pool to something other than opened in this case maybe?
Per the untrapped exception in Timekeeper.getCronTime
in the call to this.db.executeSql
, just to confirm- given the error handling they propose setting up, we would expect a call to this.pool.query
that has an 'ECONNREFUSED'
to emit an error we can handle instead of blowing up? And if it did, then pg-boss
side would be all good? Trying to think how I can best distill the problem to them using only pg
if possible in an example app. Thanks so much!
BTW You should hopefully see Wasp as a backer now as a small thanks for all the great work you are doing for the community with this OSS. š
Thanks for the support!
Following is an example of adding retries based on ECONNREFUSED. It's simple, and given the number of maintenance jobs, it fails pretty fast, so adjust as you think appropriate.
let errConnectionRetries = 0
boss.on('error', error => {
console.error(error)
if(error.code === 'ECONNREFUSED') {
errConnectionRetries++
}
if(errConnectionRetries > 5) {
console.log(`Connection lost to postgres after ${errConnectionRetries} retries. Stopping.`)
boss.stop().catch(console.error)
}
})
Thanks for the support!
Following is an example of adding retries based on ECONNREFUSED. It's simple, and given the number of maintenance jobs, it fails pretty fast, so adjust as you think appropriate.
let errConnectionRetries = 0 boss.on('error', error => { console.error(error) if(error.code === 'ECONNREFUSED') { errConnectionRetries++ } if(errConnectionRetries > 5) { console.log(`Connection lost to postgres after ${errConnectionRetries} retries. Stopping.`) boss.stop().catch(console.error) } })
This direction makes sense to me, @timgit. Thanks again! We will try this out in Wasp and see how it goes. šš» Appreciate it
Hey @timgit after a bit more testing, I unfortunately do not think the above will do the trick 100% of the time. In particular, if the app looses the DB connection right before the Timekeeper.getCronTime
call (so we have fewer than X errors), you can still get an untrapped exception. I tried this by doing an every-minute cron, and stopping the DB right at 58 seconds. š¤ Doing something like process.on('unhandledRejection'
feels a bit much here.
I wonder if pg-boss
could somehow wrap those periodic calls (or Db.executeSql
more broadly) and emit an error if it has one?
Just for additional context to the suggestion in the prior comment, after trying the diff below, the sample repro app has been running for well over 25 minutes without a DB locally and no errors have bubbled up killing node š„³ :
diff --git a/src/boss.js b/src/boss.js
index 899f323..889c94f 100644
--- a/src/boss.js
+++ b/src/boss.js
@@ -224,6 +224,10 @@ class Boss extends EventEmitter {
if (!this.stopped) {
const { rows } = await this.db.executeSql(this.getMaintenanceTimeCommand)
+ if (rows.length === 0) {
+ return { secondsAgo: 0 }
+ }
+
let { maintained_on: maintainedOn, seconds_ago: secondsAgo } = rows[0]
secondsAgo = secondsAgo !== null ? parseFloat(secondsAgo) : this.maintenanceIntervalSeconds * 10
diff --git a/src/db.js b/src/db.js
index d5ab1ca..5f2c049 100644
--- a/src/db.js
+++ b/src/db.js
@@ -23,9 +23,14 @@ class Db extends EventEmitter {
}
}
- async executeSql (text, values) {
+ async executeSql(text, values) {
if (this.opened) {
- return await this.pool.query(text, values)
+ try {
+ return await this.pool.query(text, values)
+ } catch (error) {
+ this.emit('error', error)
+ return { rows: [], rowCount: 0 }
+ }
}
}
}
diff --git a/src/timekeeper.js b/src/timekeeper.js
index a416164..afdda76 100644
--- a/src/timekeeper.js
+++ b/src/timekeeper.js
@@ -98,6 +98,10 @@ class Timekeeper extends EventEmitter {
async cacheClockSkew () {
const { rows } = await this.db.executeSql(this.getTimeCommand)
+ if (rows.length === 0) {
+ return;
+ }
+
const local = Date.now()
const dbTime = parseFloat(rows[0].time)
@@ -213,6 +217,10 @@ class Timekeeper extends EventEmitter {
async getCronTime () {
const { rows } = await this.db.executeSql(this.getCronTimeCommand)
+ if (rows.length === 0) {
+ return { secondsAgo: 0 }
+ }
+
let { cron_on: cronOn, seconds_ago: secondsAgo } = rows[0]
secondsAgo = secondsAgo !== null ? parseFloat(secondsAgo) : 61
Now what I did may be breaking some internal design assumptions, unsure, š
but just wanted to show that it is possible to prevent errors from leaking if this was a direction you wanted to go. Just let me know, and thanks for the help in making pg-boss
resilient to DB connectivity issues!
@timgit if the diff above looks ok, would you like me to turn it into a PR perhaps? Thanks!
I've started a branch for this
I've started a branch for this
Awesome, thank you, sir! If we can assist in some way, please let us know.
Thanks for the fix and new release with this included, @timgit! We at Wasp really appreciate it! š š šš»
Hi there! We use this awesome library as our jobs provider with Wasp. :D We recently had a user notice that when they deleted their DB on Fly.io, the app came crashing down. Note: Wasp does make use of
boss.on('error', error => console.error(error))
as recommended.Strangely, when I tried to reproduce locally by just shutting Postgres down when the app was running with cron jobs, I got the following error but it did not die:
So it seems like it noticed and handled it gracefully.
But then when I deployed it to Fly.io and killed the DB, I got the following (different) error:
This caused the node process to die, and it only happens when
pg-boss
is used and jobs are enabled (Express-only apps are fine). Is it possible that even thoughpg-boss
is handling the error event: https://github.com/timgit/pg-boss/blob/master/src/db.js#L15 and we are, that somehow apg
error is escaping and bubbling up to the top when the DB is killed?Happy to try to do anything needed to narrow in on this, just let me know! Thanks!