marcoCasamento / Hangfire.Redis.StackExchange

HangFire Redis storage based on original (and now unsupported) Hangfire.Redis but using lovely StackExchange.Redis client
Other
448 stars 108 forks source link

Requeue dead jobs properly #151

Closed Lexy2 closed 3 weeks ago

Lexy2 commented 2 months ago

Fixes #135

This PR includes several changes:

  1. When a RedisFetchedJob is created / fetched, it also records the Fetched time not only in its Fetched hash entry, but also as a property in its object. As such, when the job is requeued or removed from queue, its Fetched hash entry is only removed if the values match (i.e. if there was no other process that updated this value before the current process). The same logic applies to the dequeued list.

    • The two concurrent processes trying to remove and requeue the jobs are ServerJobCancellationWatcher and FetchedJobsWatcher.
  2. When FetchedJobsWatcher checks whether the job should be requeued, it also checks if this job is being executed on a live server (i.e. the server that has not been removed from the list of servers by ServerWatchdog). Note that it will not requeue the job that is being executed on a possibly aborted server that is still present in the list, judging by last heartbeat time. The reason is that server heartbeat and server timeout are values configurable per server, and the current FetchedJobsWatcher process has no insight into how these settings may be configured on other servers. Thus, it can't make decisions whether a particular server is aborted or not.

  3. RedisStorage transactional feature flags are now reported to the core library depending on how the user configured them. If the user specified UseTransactions = false, the storage reports to the core library that it does not support transactions.

  4. Deduplicated RedisStorage constructors. The way to determine what parameter is set is questionable.

  5. When a RedisFetchedJob is requeued, it now publishes its id to the subscription channel for immediate execution.

Lexy2 commented 1 month ago

@marcoCasamento how's the testing going?

Lexy2 commented 3 weeks ago

@marcoCasamento can I help you with any concerns related to this PR?

marcoCasamento commented 3 weeks ago

Absolutely no problem, but very few tests too. I usually put big changes under tests in some Test environments where I run a few hundreds of thousands of jobs daily, but there are no plans at the moment to release anything there, and at this point is totally possible that the first window will be after the summer. I don't believe this is a time that can fit a useful and apparently perfectly working PR, neither it's the level of testing a small community project is supposed to have. At this point I believe the best path is to simply push the release and deal with issues, should it ever appears. What do you think ?

Lexy2 commented 3 weeks ago

Given that the release with the previous fix is unlisted, I think it's worth publishing it now and dealing with regressions when they arrive.

I am running a forked build of this release in my environment and I'm seeing no issues.

So I'd go for it, wait for issues, and fix them forward. 🤝

Lexy2 commented 3 weeks ago

And when you approve & merge, please also publish an updated release to NuGet.

marcoCasamento commented 3 weeks ago

And when you approve & merge, please also publish an updated release to NuGet.

Of course. Thanks for your work and support

FixRM commented 3 weeks ago

Thank you guys!