timgit / pg-boss

Queueing jobs in Postgres from Node.js like a boss
MIT License
1.95k stars 153 forks source link

Spawn child processes for each job out of one worker #280

Closed phips28 closed 2 years ago

phips28 commented 2 years ago

Hi again, I read in an old comment (https://github.com/timgit/pg-boss/issues/17#issuecomment-287119721) that you use fork to spawn child processes.

We want to do the same to have only one worker that handles all the jobs, and then starts child processes to do the work. (also related to the other issue we have with the concurrency https://github.com/timgit/pg-boss/issues/274 - it works with a multi worker setup but introduces other issues) The child process must report back to the worker to complete/fail the job. Or when the master worker dies (pm2 restart), the childs must be killed as well.

Do you have a working boilerplate or code snippets of this?

timgit commented 2 years ago

We no longer use this approach and let our containers run on a single process. I was able to find some old (pre-async) code, however. You may also be interested in node's cluster module, too, btw.

Here's an example of how I was manually starting a forked worker and waiting for it to start pg-boss and report back.

function startWorker(){
    const worker = child_process.fork('worker/worker.js', {env: process.env});
    return new Promise(resolve => worker.on('message', message => message === 'ready' ? resolve() : null));
}
phips28 commented 2 years ago

Ah you started a worker, we wanted to do it the other way around: Start one worker fix with pm2, and this worker spawns one child for each job. (with good settings on concurrency to not run out of resources (CPU/RAM) on the machine) But this can also be solved with your code snippet, with some modifications and error handling.

Do you think this is a good/bad approach?

timgit commented 2 years ago

What do you think are the advantages and disadvantages of single process vs. temporary child processes?

phips28 commented 2 years ago

Its more related to our specific usecase and requirements that makes this more complex.

Currently we have 20 Workers running (started via PM2), each subscribing to all queues. But there must be only one job active at a time for a queue (as described in https://github.com/timgit/pg-boss/issues/274) - We are running on my PR with advisory lock, and that works pretty good so far, but adds more complexity or at least some knowledge bottleneck (for other programmers and devops maybe).

Each worker listens on SIGINT, to know when it gets restarted (mostly new deploys). In SIGINT, we set all jobs to created again, and also set some flags in our database to "zero". But PM2 restarts processes one-by-one, or two-by-two, .. just not all at the same time. That leads to problems where the first restarted worker already picks up new jobs, and workers at the end of the restart process, resetting those jobs again, causing duplicate jobs running after a pm2 restart. Now we use Redis to notify all workers when the first worker gets the SIGINT signal, to stop working whatsoever and stop pgboss, reset jobs, and db. Then we also set a redis key with a TTL of x sec, and every worker waits till this flag clears. Then they start pgboss and start working - making sure there are no race conditions with resetting other jobs. This Redis stuff adds complexity again, new code, new stuff to learn/underrstand for new devs, another breaking thing.

We previously used Bull Queue, but they had a lot of other bugs and downsides, But had the one worker, multiple children approach (with memory leaks, zombies and so on ... - at least in our project)

Thats why we think the one worker x multi child way suits best for us, to have the granularity of the concurrency settings, and we dont have to care about distribution (Redis stuff as described) nor use advisory locks. Then for each job the worker creates a child, that needs to report back to finish/fail a job (promise), and must also die in case of pm2 restarting the worker. So I think thats the next think I am going to try.

We already tried just one worker to process everything, but in this case not all cores can be used on our server (24) - due to the nodejs architecture. Therfore we came up with the advisory lock in the beginning and add more pm2 processes of the worker, and now we are here. 😄

I hope you could follow our situation and requirements.

timgit commented 2 years ago

I don't have enough information to offer specific advice on this architecture, but that disclaimer aside, I do have a couple thoughts. :grin:

  1. As long as a single process is able to handle your load, I wouldn't add child processes to make sure the server is computationally active unless the nature of the problem is CPU-bound and not IO-bound. In my experience, 99% of the time I have an IO-bound workload in Node.js. As with any other architecture decision, as long as you have a way to profile the benefits of adding child processes and show that data you're fine. :+1:
  2. Jobs should be immutable from the consumer/worker perspective and not have their state manually changed (for example, resetting it back to created). If a job needs to be able to be restarted, I would recommend using retry configuration instead.