seomoz / qless-core

Core Lua Scripts for qless
MIT License
85 stars 34 forks source link

Cleanup failed jobs. #84

Closed moteus closed 5 years ago

moteus commented 5 years ago

What is correct way to keep only last N filed jobs. For now I try to figure out is it possible use qless in my use case. In my use case I can just threw away a job and forget about it We have separate logging infrastructure and we can checkout logs there. I just need write metrics about number of failures to the graphite. Each worker simply will try to complite job or call retry with some delay. If a number of retries exhausted qless now marks such job as failed and never remove them. (I have over 10M messages per day and aroud 40% will be marked as a failed because of they can not be complited) I see only one solution is just make my own counter and mark all jobs as complited. But may be there exists some efficien way to remove all failed jobs from the queue?

dlecocq commented 5 years ago

Thanks for the question!

The philosophy of qless is generally that if a job fails, it may need attention, which is why successfully-completed jobs eventually expire out of the system but failed jobs stay around indefinitely.

If you truly don't want to hold on to failed jobs, then part of your strategy might be to try / catch everything in the job code and increment your failure counter in the catch. However, that won't help the retries-exhausted type failures.

There are two APIs that can help - the failed API which will indicate what failure groups exist and how many jobs are in each, returning something like:

{
  'failure-type-1': 17,
  'failure-type-2': 83,
  ...
}

Realistically, the groups are generally reflective of the uncaught exception, or <queue-name>-failed-retries.

That same API can also accept a type to get the actual jobs. Using the example above, we could call it with failure-type-1 to get a response something like this:

{
  'total': 17,
  'jobs': ['job-id-1', 'job-id-2', ...]
}

With a list of all the job IDs you want to cancel, you can use the cancel API. It accepts an arbitrary number of job IDs, so you can cancel jobs in large batches as well.

moteus commented 5 years ago

Thank you for the answer. My plan is

  1. Do not fail job, just mark all of then as complite
  2. Create simple cron job to clean up failed tasks (e.g. if worker crashed)

Do you think is it worth to extend qless-core API to clean up failed tasks?

dlecocq commented 5 years ago

There are a couple wrinkles with the possibility of extending the core API to clean up failed jobs:

All that said, I wouldn't object to such an API - I think others would use it. I don't personally have the bandwidth for it, though.