grantcarthew / node-rethinkdb-job-queue

A persistent job or task queue backed by RethinkDB.
https://github.com/grantcarthew/node-rethinkdb-job-queue/wiki
MIT License
156 stars 16 forks source link

Delayed jobs documentation isn't clear #14

Closed alexgorbatchev closed 8 years ago

alexgorbatchev commented 8 years ago

I'm having an issue with setting up my workers and queue master. Reading the docs (btw, amazing docs, wow!) I have a few questions:

  1. should there be only one queue master in the cluster?
  2. does queue master "re-ping" delayed jobs?
alexgorbatchev commented 8 years ago

With the masterInterval=false fix (#15), it appears to me that delayed jobs won't be executed on that node. Does this sound correct?

If so, I would expect queue master node to fire off delayed jobs as well, otherwise every node that wants to run delayed jobs need to queue master, which seems wasteful.

grantcarthew commented 8 years ago

Hi @alexgorbatchev,

I don't have a heap of time right now. I will look at #15 when I can. Busy building a balcony.

  1. Yes, there should only be one Queue Master in the cluster.
  2. Yes, the Queue Master should keep working on the delayed jobs once found. I will confirm this. The Queue Master should initiate queue-process.restart().

You may have found a hole in my logic. If a job is waiting and a Queue object is not busy, it will not process delayed jobs, I think. I will have to confirm. I might have to get the Queue Master to cause a change of some sort on the delayed job so the change feed will initiate the processing.

This is just off the top of my head Alex. I'll get back to you.

alexgorbatchev commented 8 years ago

I might have to get the Queue Master to cause a change of some sort on the delayed job so the change feed will initiate the processing.

I think that might be the issue here, even though db-review is running. Also, I noticed that queue-process.restart() might be blocked by if (!q._handler) { return }. This makes me think that if queue master doesn't have any q.process() calls then queue-process.restart() essentially is a null-op.

In other words, a Queue Master that only calls q.addJob and never calls q.process will not process delayed jobs.

I'm very early in the application development process, so running queue master on all nodes isn't an issue for me and won't be for a while (until i get closer to production). Making all nodes Queue Master is sufficient for my needs at the moment.

Thanks for looking at this and good luck with that balcony :)

grantcarthew commented 8 years ago

@alexgorbatchev, Just thinking about solutions here. There is a state document I am using for the global pause feature. I could add a field to that document upon review which would cause the worker nodes to restart if they are not running max concurrency.

This means after a review has been completed, possibly all workers would query the database.

The Queue Master should not process delayed jobs unless you add a handler function, in which case it is not only a Queue Master but also a worker node.

alexgorbatchev commented 8 years ago

I could add a field to that document upon review which would cause the worker nodes to restart if they are not running max concurrency.

Would that affect delayed job as well?

The Queue Master should not process delayed jobs unless you add a handler function, in which case it is not only a Queue Master but also a worker node.

I find this very odd. I want to have a Queue Master that schedules jobs and lots of workers that process jobs. I don't want Queue Master to be a worker as well.

grantcarthew commented 8 years ago

Would that affect delayed job as well?

The idea would be to prompt worker nodes to queue-get-next-job which would find jobs that were delayed and are now available for processing.

I think an update to the state document is probably the best way to go. The logic would go something like this:

Queue Master Review => Update State Document from Queue Master => Change feed sent to all Queue objects => Queue object (worker nodes) call queue-get-next-job => Valid jobs get processed

I don't want Queue Master to be a worker as well.

The Queue Master (Publisher) does not need the handler function. Read here: https://github.com/grantcarthew/node-rethinkdb-job-queue/wiki/Queue-PubSub

A Queue object will always have the ability to add jobs. It will only get the ability to process jobs if you add a handler. Don't add a hander, it is a publisher.

TomKaltz commented 8 years ago

@grantcarthew I was wondering this same thing about remote workers and delayed jobs. Your solution sounds good.

The idea would be to prompt worker nodes to queue-get-next-job which would find jobs that were delayed and are now available for processing.

This would also negate having to poll for runnable jobs in the review process. The queue master would just be concerned with marking orphaned jobs failed and deleting old jobs, correct?

TomKaltz commented 8 years ago

Actually now that I think about it more. Will this modification trigger remote workers to process delayed jobs as well? Seems that if there are no orphaned jobs then the status document will not get updated. If there are no busy workers then a delayed job will never get processed. Should the queue master also be concerned with sending out notification when a delayed job is due to be processed?

grantcarthew commented 8 years ago

@TomKaltz, @alexgorbatchev I have mostly finished this now. Just fixing the tests. I decided to update the state document every time the Queue Master reviews the database, this will cause the Queue objects that have a process handler to query the database. In turn working on any jobs now valid.

Unless you have many queue worker nodes it shouldn't be a bit hit on the database. It is randomly offset within a second also.

TomKaltz commented 8 years ago

Yeah I saw the simple load balancing implementation. This is awesome stuff!

alexgorbatchev commented 8 years ago

...this will cause the Queue objects that have a process handler to query the database. In turn working on any jobs now valid.

I'm curious how efficient is this. If I have 20 workers and one Queue Master, will all 20 worker will query the DB at the same time when the state document gets updated? Would it be possible and/or make more sense to update the job objects so that the right change feed gets triggered for all the workers instead?

grantcarthew commented 8 years ago

I thought about it for a while Alex. I think this is the simplest solution. Best solution...?

I have added two features to prevent unnecessary work.

Firstly, a worker will only restart processing (query the db) if its q.running job count is under q.concurrency.


if (q.running < q.concurrency) {
   return restartProcessing(q)
}

Secondly, I am adding a random delay on the worker restart


function restartProcessing (q) {
  logger('restartProcessing')
  setTimeout(function randomRestart () {
    queueProcess.restart(q)
  }, Math.floor(Math.random() * 1000))
}

So if the queue is really busy and the workers are flat out processing jobs, there will be no extra db calls.

If some of your workers are not running at concurrency, they will within a random 1sec interval, do a simple indexed query to get-next-job.

I have not done any performance testing against the queue however because we are using the index, using your example, 20 workers all running a get-next-job query at random times within a second should not cause any trouble.

I could bump the random delay up to a higher value? I decided 1 second because I imagine users of the queue would want delayed jobs to start as soon as can be. An extra 10 seconds would not hurt. I could set the random period to 10000.

Would it be possible and/or make more sense to update the job objects so that the right change feed gets triggered for all the workers instead?

Even if there was only one job in the queue that had gone past the dateEnable date and was now available to process, the change feed would cause the worker nodes to query the database anyway.

TomKaltz commented 8 years ago

IMHO, I don't think the randomized pull period should be any longer than 1000ms. It would nice to have it default to 1000ms but make it configurable on the queue object.

You'd probably have to have a pretty big networked architecture with a LOT of worker processes to degrade database performance.

grantcarthew commented 8 years ago

Agree Tom. Guys the v0.4.2 has been published with the state document review implemented as above. It can be changed. I will leave this issue open until I fix the documentation.

grantcarthew commented 8 years ago

OK @alexgorbatchev, I have updated the Queue Master document: https://github.com/grantcarthew/node-rethinkdb-job-queue/wiki/Queue-Master

I also created a new document defining the State Document: https://github.com/grantcarthew/node-rethinkdb-job-queue/wiki/State-Document

That should clear up any confusion.

This was a big hole in my logic Alex. Thanks for finding it. Feel free to reopen this or make new issues if anything is unclear.

Thanks again.