Maximum Number of Attempts (requeues)

anemitz commented 9 years ago

One key component to the dequeue/requeue flow seems to be the ability to stop requeuing a job past some number of failed attempts. This is often a requirement so that the queue doesn't blow up when the jobs are able to be lost past some certain point.

The only way this seems possible today is to dequeue => (fail) => finish => enqueue (incrementing a counter within the payload). This isn't a great solution so I propose adding a read-only 'attempt' key within the Sharq message like so:

Dequeue

{
  "job_id": "b81c07a7-5bba-4790-ab40-a061994088c1",
  "payload": {
    "hello": "world"
  },
  "queue_id": "1",
  "attempt": 0,
  "status": "success"
}

Enqueue

On enqueue the job's requeue attempt count would be set to 0.

Each enqueue operation can optionally set a maximum requeue limit, much like the interval, and overriding the global job_requeue_limit setting.

{
  "job_id": "b81c07a7-5bba-4790-ab40-a061994088c1",
  "interval": 1000,
  "requeue_limit": 2,
  "payload": {"hello": "world"}
}

Requeue

For each job that needs to be requeued:
    if attempt >= config.get('sharq', 'job_requeue_limit'):
        delete the job from the queue
    ....

Configuration

[sharq]
job_expire_interval  : 1000 ; in milliseconds
job_requeue_interval : 1000 ; in milliseconds

; Number of times a job can be requeued (optional). This means it's only possible to 
; dequeue the job job_requeue_limit + 1 times. By default a job will be requeued
; indefinitely.
job_requeue_limit : 2 ; a job can only be dequeued 3 times (or requeued twice)

anemitz commented 9 years ago

It might make sense to only set the limit at the enqueue operation level after a bit more thought. That would keep it a bit more simple from an implementation perspective since you want to be able to specify a per-queue retry limit (maybe sending sms's you retry 10 times but for webhooks you retry only 3 times).

sandeepraju commented 9 years ago

Thanks for the feedback and your thoughts on this feature. After reading through your suggestion, here is what I think we can do.

There will be a configuration parameter called default_job_requeue_limit. This will be set to the number of requeues a job will undergo before being discarded from SHARQ. The default value of this will be -1 (which means requeue infinitely so that user's don't have to change anything in terms of code when they upgrade). The configuration file will look like this:

[sharq]
job_expire_interval       : 1000 ; in milliseconds
job_requeue_interval      : 1000 ; in milliseconds
default_job_requeue_limit : -1 ; retries infinitely

[sharq-server]
host                      : 127.0.0.1
port                      : 8080
workers                   : 1 ; optional
accesslog                 : /tmp/sharq.log ; optional

[redis]
db                        : 0
key_prefix                : sharq_server
conn_type                 : tcp_sock ; or unix_sock
;; unix connection settings
unix_socket_path          : /tmp/redis.sock
;; tcp connection settings
port                      : 6379
host                      : 127.0.0.1

Then the enqueue request needs to specify the requeue_limit per queue (sms/1, sms/2, webhooks/1) in the enqueue request. This parameter is optional (the value from the config is used if it is not specified). A sample enqueue request with this feature turned on looks like this:

curl -H "Accept: application/json" \
-H "Content-type: application/json" \
-X POST -d ' {"job_id": "b81c07a7-5bba-4790-ab40-a061994088c1", "interval": 1000, "requeue_limit": 5, "payload": {"message": "hello, world"}}' \
http://localhost:8080/enqueue/sms/1/

The response of a dequeue request shows the number of retries left (pending_attempts) upon each dequeue as follows:

{
  "job_id": "b81c07a7-5bba-4790-ab40-a061994088c1",
  "payload": {
    "message": "hello, world"
  },
  "queue_id": "1",
  "pending_attempts": 4,
  "status": "success"
}

The pending_attempts shows how many dequeue attempts can be made successfully before the job is removed from SharQ. The pending_attempts is more helpful than just the attempt number as the worker can know if any job is almost about to die (regardless of the limit set on the queue) and then maybe trigger an alert to the admin.

Once a job exceeds its requeue limit, it will be deleted from SHARQ and cannot be recovered.

anemitz commented 9 years ago

Overall LGTM.

I like the pending_attempts -- great to be able to alert.

Minor -- maybe something like requeues_remaining instead of pending_attempts? Not sure if you want to mix attempt and requeue language?

sandeepraju commented 9 years ago

requeues_remaining :thumbsup:

sandeepraju commented 9 years ago

The code with test cases is here. sharq: https://github.com/plivo/sharq/compare/requeue-limit-feature sharq-server: https://github.com/plivo/sharq-server/compare/requeue-limit-feature

@anemitz you can try these out. I'll run some more manual tests and merge it into master tomorrow :smiley:

anemitz commented 9 years ago

:+1:

anemitz commented 9 years ago

merge time?

sandeepraju commented 9 years ago

@anemitz I just pushed it to PyPI now. It should be available via pip.

plivo / sharq