timgit / pg-boss

Queueing jobs in Postgres from Node.js like a boss
MIT License
2.13k stars 158 forks source link

Queue manager just freezes and doesn't pick up any jobs or continue processing jobs #114

Closed Hossman333 closed 5 years ago

Hossman333 commented 5 years ago

Hi there! Thanks for this great library! Might not be an issue with pg-boss, but was wondering if you could offer a bit of advice on debugging. Things work great for quite awhile and then suddenly something happens where everything just dies.

Sometimes it'll die and there's a job set in the active state, but nothing is happening. Manually switching it to created doesn't do anything either. When we look at our heroku metrics we see that there is no load on the workers. We also are logging out the pool stats and we're not seeing anything being idle or even active pools.

image

What we end up doing is restarting the workers and that fixes things. The queue then begins to pick up jobs and start working through them. The one consistency and pattern we are seeing is when traffic on the queue increases it'll then eventually stop.

We also aren't seeing any errors or anything wrong. We're catching the promises and just recently added this: boss.on('error', error => logger.error(error)); to see if it would help us out. Still aren't getting any insights.

Have any advice on debugging strategies?

Is there a way to graph stuff about the manager and see if it's still checking for jobs or if it somehow got disconnected?

Thank you in advance! Anything would be helpful.

timgit commented 5 years ago

Hey there. I don't have anything to go on here, so I can't even speculate on a cause, but it sounds like whatever code is in your job handlers may be crashing your workers. One strategy you could try is bypassing the subscribe() api and building your own using the fetch() and complete() apis. This is a bit more effort on your part, but it does take a bit of the magic out of the equation and may provide you an easier to reason about test setup and identification of the root cause. Hope this helps!

philipbjorge commented 5 years ago

We are not using this library, but we have a near identical implementation of fetchNextJob ourself. We've found that with multiple consumers they appear to get deadlocked over time when processing large queues of jobs (millions of rows). We've temporarily resolved things by implementing a statement timeout so that our queue processors automagically restart via exception.

Still trying to determine why and will update this thread when/if we determine the issue and if it might be relevant to this lib/your usecase.

Hossman333 commented 5 years ago

Interesting idea @philipbjorge! Thanks for adding to this thread and to you as well @timgit. We're going to try and only have 1 queue manager and see if that helps. We also upped our teamSize and teamConcurrency to see how that affects things. Before our teamSize and teamConcurrency were the same(5), but we've changed the numbers to be 25 for size and 50 for concurrency. We have a bigger box, so we'll see how that handles things. Thanks!

Hossman333 commented 5 years ago

Hey! Thanks, we ended up resolving the problem in our application. We neglected to set a timeout on our requests library, so some of our requests were hanging forever and blocking the next job batch from being fetched.