Closed jabrown85 closed 6 years ago
I hope this is OK. I was trying to understand all the pieces and I think I got it. The problem, as I saw it, was that the timeout of the KEY used for locking would expire using the same timeout parameter as the DisableConcurrentExecution
attribute. This doesn't seem right. The attribute's parameter is the timeout value for subsequent job executions max time to wait for a lock (implemented in RedisLock correctly). Using it for the key timeout means that it could/will expire RIGHT as a job is perhaps checking again for the lock. If the job obtains the lock, you have 2 jobs running at the same time. The version is master works sometimes, but I think it is mostly by chance and greatly depends on the timeout/intervals between redis checks.
I haven't had any time to look at the tests, but I can take a stab at writing some if you want.
Also, I think issue #65 is this issue.
I think we would need to do something better here. (perhaps more like https://github.com/frankhommers/Hangfire.PostgreSql/pull/44/)
The problem with my quick fix is that since the lock never expires, a hard crash means that nothing will ever run until the lock is manually removed. I think exposing a lock timeout in the options will allow application builders to specify the longest timespan they intend on letting a lock live. A different option would be having the lock renew periodically, but I don't know if a ServerFilter is the right place for that.
In the meantime, I worked around this issue by removing the attribute entirely and using a second queue with a single worker. (https://discuss.hangfire.io/t/different-queues-having-different-worker-counts/114). It's less clean, but will guarantee that only one runs at a time.
@jabrown85 please check out #69, I think it should resolve all the problems here.
That looks good! That’s what I figured needed to happen based on other implementations. I hope that gets merged soon!
It worked fine for a day on my Linux server but still needs some more testing.
The key being used for a lock was being expired at the same time as the timeout for the subsequent lock timeouts. This meant that a lock for would expire basically as soon as the subsequent job waiting for the lock would stop. More often than not, this meant the job that is awaiting the lock would actually acquire the lock while the first job that obtained it was still executing.