Cleanup failed jobs. - Githubissues

moteus commented 5 years ago

What is correct way to keep only last N filed jobs. For now I try to figure out is it possible use qless in my use case. In my use case I can just threw away a job and forget about it We have separate logging infrastructure and we can checkout logs there. I just need write metrics about number of failures to the graphite. Each worker simply will try to complite job or call retry with some delay. If a number of retries exhausted qless now marks such job as failed and never remove them. (I have over 10M messages per day and aroud 40% will be marked as a failed because of they can not be complited) I see only one solution is just make my own counter and mark all jobs as complited. But may be there exists some efficien way to remove all failed jobs from the queue?

dlecocq commented 5 years ago

Thanks for the question!

The philosophy of qless is generally that if a job fails, it may need attention, which is why successfully-completed jobs eventually expire out of the system but failed jobs stay around indefinitely.

If you truly don't want to hold on to failed jobs, then part of your strategy might be to try / catch everything in the job code and increment your failure counter in the catch. However, that won't help the retries-exhausted type failures.

There are two APIs that can help - the failed API which will indicate what failure groups exist and how many jobs are in each, returning something like:

{
  'failure-type-1': 17,
  'failure-type-2': 83,
  ...
}

Realistically, the groups are generally reflective of the uncaught exception, or <queue-name>-failed-retries.

That same API can also accept a type to get the actual jobs. Using the example above, we could call it with failure-type-1 to get a response something like this:

{
  'total': 17,
  'jobs': ['job-id-1', 'job-id-2', ...]
}

With a list of all the job IDs you want to cancel, you can use the cancel API. It accepts an arbitrary number of job IDs, so you can cancel jobs in large batches as well.

moteus commented 5 years ago

Thank you for the answer. My plan is

Do not fail job, just mark all of then as complite
Create simple cron job to clean up failed tasks (e.g. if worker crashed)

Do you think is it worth to extend qless-core API to clean up failed tasks?

dlecocq commented 5 years ago

There are a couple wrinkles with the possibility of extending the core API to clean up failed jobs:

jobs that have dependent jobs cannot be canceled directly - this was a design decision as it forces the user to explicitly either remove the dependencies or ensure that the job completes
it's a little counter to the original intention, though that intention might have outlived its usefulness. Originally we figured that if a job failed, most jobs are important enough that if enqueued we want to make sure that they eventually succeed (rather than making a best effort and clearing them out). However, with the advent of hindsight, that's not a mode that everyone uses or needs to use

All that said, I wouldn't object to such an API - I think others would use it. I don't personally have the bandwidth for it, though.

seomoz / qless-core

Cleanup failed jobs. #84