timgit / pg-boss

Queueing jobs in Postgres from Node.js like a boss
MIT License
2.15k stars 160 forks source link

Connection terminated unexpectedly with serverless db #381

Closed mepc36 closed 3 months ago

mepc36 commented 1 year ago

I keep getting this error when the connection is kept open for a longtime to a serverless db I'm running pg-boss with:

node:internal/process/promises:279
            triggerUncaughtException(err, true /* fromPromise */);
            ^
{
  message: 'Connection terminated unexpectedly (Queue: __pgboss__cron, Worker: 5a50af52-55ca-4b9b-aa55-5c994a1b473d)',
  stack: 'Error: Connection terminated unexpectedly (Queue: __pgboss__cron, Worker: 5a50af52-55ca-4b9b-aa55-5c994a1b473d)\n' +
    '    at Connection.<anonymous> (/Users/martinconnor/Desktop/rapbot-mobile/node_modules/pg/lib/client.js:132:73)\n' +
    '    at Object.onceWrapper (node:events:641:28)\n' +
    '    at Connection.emit (node:events:527:28)\n' +
    '    at Connection.emit (node:domain:475:12)\n' +
    '    at Socket.<anonymous> (/Users/martinconnor/Desktop/rapbot-mobile/node_modules/pg/lib/connection.js:57:12)\n' +
    '    at Socket.emit (node:events:527:28)\n' +
    '    at Socket.emit (node:domain:475:12)\n' +
    '    at TCP.<anonymous> (node:net:709:12)',
  queue: '__pgboss__cron',
  worker: '5a50af52-55ca-4b9b-aa55-5c994a1b473d'
}

Here's how I start the server:

app.listen(3039, async () => {
  console.log('listening on 3039')
  await boss.start();
  await boss.subscribe(`demucs-${process.env.APP_ENVIRONMENT}`, separateAudio);
})

Here's how I create the job:

const scheduleJob = async (queue, data, startAfter = 0, retryAttempts = 0, runJustOnce = true) => {
  const jobData = {
    retryAttempts,
    runJustOnce,
    data
  }

  boss.onComplete(queue, async job => {
    try {
      console.log('queue:', queue)
      const triggerResponse = await pusher.trigger(queue, 'job-event', {
        type: 'completed',
        job
      });
      console.log('.onComplete!!!!')

      boss.complete(job.id)
    } catch (error) {
      throw error
    }
  })

  const jobOptions = {
    startAfter,
    expireInMinutes: 10,
    retentionMinutes: 10,
    onComplete: true
  }

  const jobId = await boss.publish(queue, jobData, jobOptions)
  return jobId
}

It may be related to #365. I will try the solution in that thread mentioned here now. Also I'm running pg-boss version 6.2.2, and apparently a fix might have been shipped in the newer 9.0.0 version. Will try that and report back. Thanks!

mepc36 commented 1 year ago

Also I'm running this with a serverless db in AWS, so maybe when the server on which this db is running gets closed down then pg-boss throws this error?

timgit commented 1 year ago

Yes, I think you should update to latest and re-validate

mepc36 commented 1 year ago

Will do, thank you @timgit!

Love the lib btw

Edit: The shim suggested in #365 did not work, but I also think their problem was slightly different from mine.

mepc36 commented 1 year ago

Looks like it's happening in 9.0.0 too:

node:internal/errors:465
    ErrorCaptureStackTrace(err);
    ^

Error [ERR_UNHANDLED_ERROR]: Unhandled error. ({
  message: 'Connection terminated unexpectedly (Queue: __state__completed__onCompleteFtw, Worker: e5e396aa-7aaa-47b2-8c4c-200fa791ca22)',
  stack: 'Error: Connection terminated unexpectedly (Queue: __state__completed__onCompleteFtw, Worker: e5e396aa-7aaa-47b2-8c4c-200fa791ca22)\n' +
    '    at Connection.<anonymous> (/Users/martinconnor/Desktop/server-side-comm/node_modules/pg/lib/client.js:132:73)\n' +
    '    at Object.onceWrapper (node:events:641:28)\n' +
    '    at Connection.emit (node:events:527:28)\n' +
    '    at Socket.<anonymous> (/Users/martinconnor/Desktop/server-side-comm/node_modules/pg/lib/connection.js:62:12)\n' +
    '    at Socket.emit (node:events:527:28)\n' +
    '    at TCP.<anonymous> (node:net:709:12)',
  queue: '__state__completed__onCompleteFtw',
  worker: 'e5e396aa-7aaa-47b2-8c4c-200fa791ca22'
})
    at new NodeError (node:internal/errors:372:5)
    at PgBoss.emit (node:events:516:17)
    at Manager.<anonymous> (/Users/martinconnor/Desktop/server-side-comm/node_modules/pg-boss/src/index.js:88:37)
    at Manager.emit (node:events:527:28)
    at Worker.onError (/Users/martinconnor/Desktop/server-side-comm/node_modules/pg-boss/src/manager.js:256:12)
    at Worker.start (/Users/martinconnor/Desktop/server-side-comm/node_modules/pg-boss/src/worker.js:70:14)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5) {
  code: 'ERR_UNHANDLED_ERROR',
  context: {
    message: 'Connection terminated unexpectedly (Queue: __state__completed__onCompleteFtw, Worker: e5e396aa-7aaa-47b2-8c4c-200fa791ca22)',
    stack: 'Error: Connection terminated unexpectedly (Queue: __state__completed__onCompleteFtw, Worker: e5e396aa-7aaa-47b2-8c4c-200fa791ca22)\n' +
      '    at Connection.<anonymous> (/Users/martinconnor/Desktop/server-side-comm/node_modules/pg/lib/client.js:132:73)\n' +
      '    at Object.onceWrapper (node:events:641:28)\n' +
      '    at Connection.emit (node:events:527:28)\n' +
      '    at Socket.<anonymous> (/Users/martinconnor/Desktop/server-side-comm/node_modules/pg/lib/connection.js:62:12)\n' +
      '    at Socket.emit (node:events:527:28)\n' +
      '    at TCP.<anonymous> (node:net:709:12)',
    queue: '__state__completed__onCompleteFtw',
    worker: 'e5e396aa-7aaa-47b2-8c4c-200fa791ca22'
  }
}

That error happened I left this script running:

require("dotenv").config();
const PgBoss = require('pg-boss');

const pgBossOptions = {
  host: process.env.PGHOST,
  database: process.env.PGDATABASE,
  user: process.env.PGUSER,
  password: process.env.PGPASSWORD,
  deleteAfterDays: 365000
};

const boss = new PgBoss(pgBossOptions);

;
(async () => {
  // 9.0.0:
  await boss.start();
  const jobName = 'onCompleteFtw'

  boss.onComplete(jobName, job => {
    console.log('onComplete > job:', job)

    boss.complete(job.id)
  })

  const jobId = await boss.send(jobName, {}, { onComplete: true })
  const job = await boss.fetch(jobName)
  await boss.complete(job.id)
})()
mepc36 commented 1 year ago

I've been calling .start and .subscribe in the app.listen callback:

app.listen(3039, async () => {
  console.log('listening on 3039')
  await boss.start();
  await boss.subscribe(`demucs-${process.env.APP_ENVIRONMENT}`, separateAudio);
})

Where do I call boss.stop() then, whenever a job has completed? I think this bug is basically my fault, I'm not calling boss.stop() anywhere.

mepc36 commented 1 year ago

Alright now I'm calling boss.start(0 right before creating and starting a job, and boss.stop at the end of the .onCompelte callback, but I'm now getting this error about destructuring properties:

/Users/martinconnor/Desktop/rapbot-mobile/services/live/demucs-with-db-and-s3/node_modules/pg-boss/src/boss.js:81
      const { secondsAgo } = await this.getMaintenanceTime()
              ^

TypeError: Cannot destructure property 'secondsAgo' of '(intermediate value)' as it is undefined.
    at Timeout._onTimeout (/Users/martinconnor/Desktop/rapbot-mobile/services/live/demucs-with-db-and-s3/node_modules/pg-boss/src/boss.js:81:15)
timgit commented 1 year ago

A couple things

  1. You should add an error event listener: boss.on('error', ...)
  2. You should only use send() + work() and not publish() + subscribe() since these were changed in a recent semver major.
mepc36 commented 1 year ago

Hey Tim, thanks for the help. I really appreciate it. As for your points:

1.) I do have an error handler. You can see it where I am instantiating the pg-boss instance, sorry for not posting this before:

const PgBoss = require('pg-boss');

const pgBossOptions = {
  host: process.env.PGHOST,
  database: process.env.DATABASE,
  user: process.env.PGUSER,
  password: process.env.PGPASSWORD,
  deleteAfterDays: 365000
};

const boss = new PgBoss(pgBossOptions);

boss.on('error', error => {
  throw error
});
  1. I'm still getting this bug while using send + work, after upgrading to 9.0.0.

I used to start pg-boss in the app.listen callback, but now I'm doing it in the route itself:

app.post('/separate', async (req, res) => {
  try {
    await boss.start();
    const jobId = await scheduleJob(`demucs-${process.env.APP_ENVIRONMENT}`, req.body)
    res.status(201).send({ jobId })
  } catch (error) {
    res.status(500).send(error)
  }
})

Here is how I'm scheduling the job. It's also where I also stop pg-boss, inside the .onComplete callback:

const scheduleJob = async (queue, data, startAfter = 0, retryAttempts = 0, runJustOnce = true) => {
  const jobData = {
    retryAttempts,
    runJustOnce,
    data
  }

  boss.onComplete(queue, async job => {
    try {
      const demucsResponse = await separateAudio(job)
      const triggerResponse = await pusher.trigger(queue, 'job-event', {
        type: 'completed',
        demucsResponse
      });

      await boss.stop()
      return demucsResponse
    } catch (error) {
      throw error
    }
  })

  const jobOptions = {
    startAfter,
    expireInMinutes: 10,
    retentionMinutes: 10,
    onComplete: true
  }

  const jobId = await boss.send(queue, jobData, jobOptions)
  const job = await boss.fetch(queue)
  await boss.complete(job.id)
  return job.id
}

Although this arrangement of calling .stop() means I don't get the Connection terminated unexpectedly error anymore, it throws this new Cannot destructure property 'secondsAgo' error.

So I'm in a catch-22: when I remove boss.stop(), I get the connection-terminated error, and when I put it in, I get the cannot-destructure error. I can't help but feel that I"m doing something wrong as far as what I'm calling where.

The question boils down to this: in an Express server app, where does boss.start(), boss.send(), and boss.work() get called when a server is creating a job through an endpoint; running that job asynchronously (i.e., the route returns before the job's work is done); and then emitting a job-is-finished event? Any thoughts?

timgit commented 1 year ago

I treat pg-boss in my API like a connection pool. It's a long-lived object that is created by a service that stays running for the entire uptime of the application

mepc36 commented 1 year ago

Thanks Tim, that's helpful info. I really feel like the problem is with some db config settings then. I have an RDS instance running on AWS. I will investigate the timeouts there and then report back.

mepc36 commented 1 year ago

Turns out this is a known issue with node-postgres, which pg-boss appears to use under the hood:

https://github.com/brianc/node-postgres/issues/2112

Takeaways: probably has to do with timeout settings of infrastructure provider. Might need 3rd party software to fix; see more here. Here's a better comment here.

No easy solutions there, apparently answer might involve a db change. I will report back with fix.

KPouianou commented 1 year ago

I am facing the same destructuring error mentioned above - but I have no way to reproduce it. It just seems to randomly crash my Blitz.js+Postgres app. Any clues on how to reproduce or any workarounds?

mepc36 commented 1 year ago

@KPouianou sorry, not yet. What is your db running on? AWS?

mepc36 commented 1 year ago

HI @timgit, could you please confirm that pg-boss does not allow the keepAlive option to be set when configuring pg-boss' connection to the database via node-postgres?

Docs on pg-boss' connection options: https://github.com/timgit/pg-boss/blob/master/docs/readme.md#newoptions node-postgres maintainer's note about using keepAlive to send TCP keep_alive packets: https://github.com/brianc/node-postgres/issues/2362#issuecomment-704286512 pg-boss' instantiation of DB connex: https://github.com/timgit/pg-boss/blob/master/src/db.js#L14

mepc36 commented 1 year ago

Possible fix from SOF: https://stackoverflow.com/a/65608187

KPouianou commented 1 year ago

@KPouianou sorry, not yet. What is your db running on? AWS?

It is a Postgres running on AWS (RDS)

timgit commented 1 year ago

If this would be resolved by simply passing any additional configuration through to node-pg, that seems like a good solution. Did you already try using keepAlive?

KPouianou commented 1 year ago

Possible fix from SOF: https://stackoverflow.com/a/65608187

This did not seem to fix the issue, but it seems to be occurring less often. Do you think increasing the tcp_keepalives_idle beyond 60 might help?

timgit commented 1 year ago

I don't run pg in RDS so any testing and feedback you can do would be appreciated.

mepc36 commented 1 year ago

I don't run pg in RDS so any testing and feedback you can do would be appreciated.

Will do!

Do you think increasing the tcp_keepalives_idle beyond 60 might help?

Hoping to test that this week, I'll let you know.

mepc36 commented 1 year ago

Going to still keep this open. I do not have console access to the db in question currently, although that's in the process of changing. Will test the new config when that happens. Thanks!

ygrenzinger commented 1 year ago

For information, same problem is happening on a cloud provider named CleverCloud. They are doing network maintenance and each "glitch" cause this exception crashing the app.

mlynch commented 10 months ago

I'm also seeing that destructuring issue when calling stop(). I'm having similar issues using PgBoss in a few serverless and long-lived server process. Is there a proper way to initialize and cleanup PgBoss instances so connections are returned back to the pool and to be as lightweight as possible? In particular, when using PgBoss against a pooled connection manager on the server I tend to see more of these issues come up. I've also had a few issues where the pooler threw "too many clients" errors, indicating to me my PgBoss instances are just stacking up and never cleaning themselves out.

When I send new messages, I generally do:

const boss = new PgBoss(...)
boss.start();
boss.send()

Should I also be calling stop()?

timgit commented 3 months ago

Please try again with v10 that was recently merged and reopen if it's not