taoensso / carmine

Redis client + message queue for Clojure
https://www.taoensso.com/carmine
Eclipse Public License 1.0
1.15k stars 130 forks source link

After retrying tasks many times, they disappear #176

Closed uaalto closed 7 years ago

uaalto commented 7 years ago

After a couple of days without visiting the task queues, I found out that over a 500 tasks have failed temporarily and have issued a {:status :retry}. However, no task remains in the queues so I can fix the issue and let them succeed. How can this situation be reached? Am I missing some configuration?

ptaoussanis commented 7 years ago

Hi there,

I'm afraid this report is going to need more details.

How can this situation be reached? Am I missing some configuration?

Not sure I understand the question. You're asking if it's normal for tasks to be dropped after a certain number of retries? Not on Carmine's end, but I'm not sure what your application/handler logic is.

If your handler continually returns {:status :retry} when given a task, it should keep retrying the task forever.

If your handler returns {:status :success} or {:status :error}, the task is considered completed and will be garbage collected.

uaalto commented 7 years ago

Thanks for you reply @ptaoussanis. I've been busy trying to fix some issues. We are currently using this task system for very sensitive processes that need to be reliable.

We've also experienced another issue where the tasks disappeared spontaneusly too. The bug seemed to not happen when we stopped setting the :threads parameter. Unfortunately, this is very hard to replicate, and we have the system in production. I intended to set up a full testing suite of this to stress the task system and try to trigger the bug, but I didn't find the time for that yet.

Answering your questions:

What version of Carmine? [com.taoensso/carmine "2.13.3-uaalto"]

What do you mean by a "couple" of days? Tasks have been accumulating and disappearing for a couple of days. That's the time they've had to disappear.

What do you mean by "many" times? The amount of times a task is retried before disappearing is not consistent. Might be 10-50 and is variable IRCC.

Have you confirmed that the tasks didn't in fact eventually successfully execute after retrying? The tasks were triggering only the retry. I can't demonstrate 100% that's the case, but the disappearing of tasks as I mentioned has happened under other circumstances as well.

Have you confirmed that your Redis instance hasn't been pruning keys because of memory limitations, etc.? I haven't confirmed that. We don't have memory limitations at the moment, but how is this even triggered in Redis and why it would select these keys?

Being two bugs that I can hardly show evidence for, that seem the same or closely related, I think the best way to find out is to stress-test the system. I planned to do that, but probably won't make it any time soon since I have many important tasks ATM.

ptaoussanis commented 7 years ago

Hi Ulysses, sorry for the delay replying.

Think the best way of proceeding on this if you're still having problems (?) would be trying to produce some kind of reproducible example that I could look at and debug from my end.

uaalto commented 7 years ago

I coulnd't reproduce this in a long time. Closing. Thanks for your help @ptaoussanis!

ptaoussanis commented 7 years ago

No problem, thanks for the update :-)