Unexpected node shutdown when database is abruptly shutdown

shayneczyzewski commented 1 year ago

Hi there! We use this awesome library as our jobs provider with Wasp. :D We recently had a user notice that when they deleted their DB on Fly.io, the app came crashing down. Note: Wasp does make use of boss.on('error', error => console.error(error)) as recommended.

Strangely, when I tried to reproduce locally by just shutting Postgres down when the app was running with cron jobs, I got the following error but it did not die:

Server(stderr): error: terminating connection due to administrator command

So it seems like it noticed and handled it gracefully.

But then when I deployed it to Fly.io and killed the DB, I got the following (different) error:

2023-02-01T19:58:32Z app[f07d5e6d] mia [info]Error: Connection terminated unexpectedly
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    at Connection.<anonymous> (/app/server/node_modules/pg/lib/client.js:131:73)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    at Object.onceWrapper (node:events:627:28)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    at Connection.emit (node:events:513:28)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    at Socket.<anonymous> (/app/server/node_modules/pg/lib/connection.js:112:12)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    at Socket.emit (node:events:525:35)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    at endReadableNT (node:internal/streams/readable:1359:12)
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    at process.processTicksAndRejections (node:internal/process/task_queues:82:21) {
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]  client: Client {
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    _events: [Object: null prototype] { error: [Function (anonymous)] },
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    _eventsCount: 1,
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    _maxListeners: undefined,
2023-02-01T19:58:32Z app[f07d5e6d] mia [info]    connectionParameters: 
...

This caused the node process to die, and it only happens when pg-boss is used and jobs are enabled (Express-only apps are fine). Is it possible that even though pg-boss is handling the error event: https://github.com/timgit/pg-boss/blob/master/src/db.js#L15 and we are, that somehow a pg error is escaping and bubbling up to the top when the DB is killed?

Happy to try to do anything needed to narrow in on this, just let me know! Thanks!

shayneczyzewski commented 1 year ago

The relevant pg code is here:

    con.once('end', () => {
      const error = this._ending ? new Error('Connection terminated') : new Error('Connection terminated unexpectedly')

      clearTimeout(this.connectionTimeoutHandle)
      this._errorAllQueries(error)

      if (!this._ending) {
        // if the connection is ended without us calling .end()
        // on this client then we have an unexpected disconnection
        // treat this as an error unless we've already emitted an error
        // during connection.
        if (this._connecting && !this._connectionError) {
          if (this._connectionCallback) {
            this._connectionCallback(error)
          } else {
            this._handleErrorEvent(error)
          }
        } else if (!this._connectionError) {
          this._handleErrorEvent(error)
        }
      }

      process.nextTick(() => {
        this.emit('end')
      })
    })

I can also ask them directly, but was mainly curious if you have seen this before or anything jumps out at you for me to try and narrow in on. Thanks again!

shayneczyzewski commented 1 year ago

Lastly, tomorrow I'll get a super minimal example app for reproduction setup. 👍🏻

shayneczyzewski commented 1 year ago

Here is a minimally reproducible example (including logs) using the latest pg-boss and Node 19: https://github.com/shayneczyzewski/pgboss-fly

In Wasp, we use Node 18 and pg-boss version 8.0.0 with the same results. Thanks for helping to take a quick look to see if it's something pg-boss can handle, or if I should head over to pg or Fly.io to ask about! 👍🏻

timgit commented 1 year ago

Thanks for the sample project. I'll try to repro.

timgit commented 1 year ago

I've been able to repro node crashing just by stopping postgres locally. However, I can see pg-boss logging errors for a while before the crash since the workers haven't been told to stop trying. Eventually, it does crash, but from an unhandled error that's internal in pg.

For example, if you change your example to the following, node doesn't crash because all workers have been stopped.

    boss.on('error', error => {
        console.error(error)
        boss.stop({ graceful: false }).catch(err => console.error(err))
    });

However, I wouldn't recommend using this severe of an error handler since not all error events will have the same severity. If you add a condition to look for these fatal errors from pg, you could conditionally exit, which may resolve this issue in fly.

timgit commented 1 year ago

Regarding https://github.com/timgit/pg-boss/issues/365#issuecomment-1414692855, I added the following but node still crashed, so that's not the magic solution. :(

pool.on('connect', client => client.on('error', error => this.emit('error', error)))

shayneczyzewski commented 1 year ago

Hey @timgit thanks so much for jumping in on this. I will review the linked pg Issue (it does feel related), and try to getting it repro locally (I guess I didn't stop postgres abruptly enough :D). Once I can repro locally I will try some things out too to see if we can catch that error. Appreciate the attention to this, and have a great weekend!

shayneczyzewski commented 1 year ago

I was able to repro locally and confirm what you noted- that solution does not work. With that code in place I tried the following.

All the fetch calls seem to be producing those outputs like:

{
  errno: -61,
  code: 'ECONNREFUSED',
  syscall: 'connect',
  address: '::1',
  port: 5432,
  message: 'connect ECONNREFUSED ::1:5432 (Queue: __pgboss__maintenance, Worker: 8cf3d2e9-d85d-40ef-b156-e5b94c506458)',
  stack: 'Error: connect ECONNREFUSED ::1:5432\n' +
    '    at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1300:16)',
  queue: '__pgboss__maintenance',
  worker: '8cf3d2e9-d85d-40ef-b156-e5b94c506458'
}

It seems to blow up for me when Timekeeper.getCronTime is called. I wonder why that call this.db.executeSql isn't getting trapped like the one for the workers. 🤔 Anyways, will keep poking around! 👍🏻

shayneczyzewski commented 1 year ago

Hey @timgit, I was just starting to think about it more and I'm wondering what we can do as a stopgap in this scenario. Do you think the best case is for me to try and catch/decode the type of pg error and shut pg-boss down if it is critical (sort of what you replied in your first comment)? Or, assuming we even can fix the pg error leak, would just letting it keep trying on this type of error be the ideal state (maybe not)? Do we need to change the internal pg-boss DB status of the pool to something other than opened in this case maybe?

Per the untrapped exception in Timekeeper.getCronTime in the call to this.db.executeSql, just to confirm- given the error handling they propose setting up, we would expect a call to this.pool.query that has an 'ECONNREFUSED' to emit an error we can handle instead of blowing up? And if it did, then pg-boss side would be all good? Trying to think how I can best distill the problem to them using only pg if possible in an example app. Thanks so much!

BTW You should hopefully see Wasp as a backer now as a small thanks for all the great work you are doing for the community with this OSS. 🚀

timgit commented 1 year ago

Thanks for the support!

Following is an example of adding retries based on ECONNREFUSED. It's simple, and given the number of maintenance jobs, it fails pretty fast, so adjust as you think appropriate.

let errConnectionRetries = 0

  boss.on('error', error => {

    console.error(error)

    if(error.code === 'ECONNREFUSED') {
      errConnectionRetries++
    }

    if(errConnectionRetries > 5) {
      console.log(`Connection lost to postgres after ${errConnectionRetries} retries.  Stopping.`)
      boss.stop().catch(console.error)
    }

  })

shayneczyzewski commented 1 year ago

Thanks for the support!

Following is an example of adding retries based on ECONNREFUSED. It's simple, and given the number of maintenance jobs, it fails pretty fast, so adjust as you think appropriate.
let errConnectionRetries = 0

  boss.on('error', error => {

    console.error(error)

    if(error.code === 'ECONNREFUSED') {
      errConnectionRetries++
    }

    if(errConnectionRetries > 5) {
      console.log(`Connection lost to postgres after ${errConnectionRetries} retries.  Stopping.`)
      boss.stop().catch(console.error)
    }

  })

This direction makes sense to me, @timgit. Thanks again! We will try this out in Wasp and see how it goes. 👍🏻 Appreciate it

shayneczyzewski commented 1 year ago

Hey @timgit after a bit more testing, I unfortunately do not think the above will do the trick 100% of the time. In particular, if the app looses the DB connection right before the Timekeeper.getCronTime call (so we have fewer than X errors), you can still get an untrapped exception. I tried this by doing an every-minute cron, and stopping the DB right at 58 seconds. 🤔 Doing something like process.on('unhandledRejection' feels a bit much here.

I wonder if pg-boss could somehow wrap those periodic calls (or Db.executeSql more broadly) and emit an error if it has one?

shayneczyzewski commented 1 year ago

Just for additional context to the suggestion in the prior comment, after trying the diff below, the sample repro app has been running for well over 25 minutes without a DB locally and no errors have bubbled up killing node 🥳 :

diff --git a/src/boss.js b/src/boss.js
index 899f323..889c94f 100644
--- a/src/boss.js
+++ b/src/boss.js
@@ -224,6 +224,10 @@ class Boss extends EventEmitter {
     if (!this.stopped) {
       const { rows } = await this.db.executeSql(this.getMaintenanceTimeCommand)

+      if (rows.length === 0) {
+        return { secondsAgo: 0 }
+      }
+
       let { maintained_on: maintainedOn, seconds_ago: secondsAgo } = rows[0]

       secondsAgo = secondsAgo !== null ? parseFloat(secondsAgo) : this.maintenanceIntervalSeconds * 10
diff --git a/src/db.js b/src/db.js
index d5ab1ca..5f2c049 100644
--- a/src/db.js
+++ b/src/db.js
@@ -23,9 +23,14 @@ class Db extends EventEmitter {
     }
   }

-  async executeSql (text, values) {
+  async executeSql(text, values) {
     if (this.opened) {
-      return await this.pool.query(text, values)
+      try {
+        return await this.pool.query(text, values)
+      } catch (error) {
+        this.emit('error', error)
+        return { rows: [], rowCount: 0 }
+      }
     }
   }
 }
diff --git a/src/timekeeper.js b/src/timekeeper.js
index a416164..afdda76 100644
--- a/src/timekeeper.js
+++ b/src/timekeeper.js
@@ -98,6 +98,10 @@ class Timekeeper extends EventEmitter {
   async cacheClockSkew () {
     const { rows } = await this.db.executeSql(this.getTimeCommand)

+    if (rows.length === 0) {
+      return;
+    }
+
     const local = Date.now()

     const dbTime = parseFloat(rows[0].time)
@@ -213,6 +217,10 @@ class Timekeeper extends EventEmitter {
   async getCronTime () {
     const { rows } = await this.db.executeSql(this.getCronTimeCommand)

+    if (rows.length === 0) {
+      return { secondsAgo: 0 }
+    }
+
     let { cron_on: cronOn, seconds_ago: secondsAgo } = rows[0]

     secondsAgo = secondsAgo !== null ? parseFloat(secondsAgo) : 61

Now what I did may be breaking some internal design assumptions, unsure, 😅 but just wanted to show that it is possible to prevent errors from leaking if this was a direction you wanted to go. Just let me know, and thanks for the help in making pg-boss resilient to DB connectivity issues!

shayneczyzewski commented 1 year ago

@timgit if the diff above looks ok, would you like me to turn it into a PR perhaps? Thanks!

timgit commented 1 year ago

I've started a branch for this

shayneczyzewski commented 1 year ago

I've started a branch for this

Awesome, thank you, sir! If we can assist in some way, please let us know.

shayneczyzewski commented 1 year ago

Thanks for the fix and new release with this included, @timgit! We at Wasp really appreciate it! 🐝 🚀 👏🏻

timgit / pg-boss

Unexpected node shutdown when database is abruptly shutdown #365