imranmomin / Hangfire.AzureCosmosDb

Azure Cosmos DB storage provider for Hangfire
https://www.hangfire.io/
MIT License
17 stars 16 forks source link

Locks don't get removed, when hangfire doesn't shutdown application gracefully #45

Closed rlekni closed 2 years ago

rlekni commented 2 years ago

Whenever Hangfire fails to gracefully shutdown, locks don't get removed from the database, meaning, once the application restarts, or if there's another Hangfire server, they go into a failed state, as the locks already exist (even if they are stale).

When locks exist, exception is thrown due eventual timeout. Due to this causing Failed state for the Hangfire servers. It causes Jobs not getting processed Related issue.

Workaround: Manually remove the locks from the database.

Example logs: {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to create a lock [locks:schedulepoller]. Status - 409 Conflict. Lock already exists","State":{"Message":"Unable to create a lock [locks:schedulepoller]. Status - 409 Conflict. Lock already exists"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to acquire lock [locks:schedulepoller]. Will try after [2] seconds","State":{"Message":"Unable to acquire lock [locks:schedulepoller]. Will try after [2] seconds"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to create a lock [recurring-jobs:lock]. Status - 409 Conflict. Lock already exists","State":{"Message":"Unable to create a lock [recurring-jobs:lock]. Status - 409 Conflict. Lock already exists"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to acquire lock [recurring-jobs:lock]. Will try after [2] seconds","State":{"Message":"Unable to acquire lock [recurring-jobs:lock]. Will try after [2] seconds"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to create a lock [locks:counters:aggregator]. Status - 409 Conflict. Lock already exists","State":{"Message":"Unable to create a lock [locks:counters:aggregator]. Status - 409 Conflict. Lock already exists"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to acquire lock [locks:counters:aggregator]. Will try after [2] seconds","State":{"Message":"Unable to acquire lock [locks:counters:aggregator]. Will try after [2] seconds"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to create a lock [locks:schedulepoller]. Status - 409 Conflict. Lock already exists","State":{"Message":"Unable to create a lock [locks:schedulepoller]. Status - 409 Conflict. Lock already exists"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to acquire lock [locks:schedulepoller]. Will try after [2] seconds","State":{"Message":"Unable to acquire lock [locks:schedulepoller]. Will try after [2] seconds"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to create a lock [recurring-jobs:lock]. Status - 409 Conflict. Lock already exists","State":{"Message":"Unable to create a lock [recurring-jobs:lock]. Status - 409 Conflict. Lock already exists"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to acquire lock [recurring-jobs:lock]. Will try after [2] seconds","State":{"Message":"Unable to acquire lock [recurring-jobs:lock]. Will try after [2] seconds"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to create a lock [locks:counters:aggregator]. Status - 409 Conflict. Lock already exists","State":{"Message":"Unable to create a lock [locks:counters:aggregator]. Status - 409 Conflict. Lock already exists"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to acquire lock [locks:counters:aggregator]. Will try after [2] seconds","State":{"Message":"Unable to acquire lock [locks:counters:aggregator]. Will try after [2] seconds"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to create a lock [locks:schedulepoller]. Status - 409 Conflict. Lock already exists","State":{"Message":"Unable to create a lock [locks:schedulepoller]. Status - 409 Conflict. Lock already exists"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to acquire lock [locks:schedulepoller]. Will try after [2] seconds","State":{"Message":"Unable to acquire lock [locks:schedulepoller]. Will try after [2] seconds"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to create a lock [recurring-jobs:lock]. Status - 409 Conflict. Lock already exists","State":{"Message":"Unable to create a lock [recurring-jobs:lock]. Status - 409 Conflict. Lock already exists"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to acquire lock [recurring-jobs:lock]. Will try after [2] seconds","State":{"Message":"Unable to acquire lock [recurring-jobs:lock]. Will try after [2] seconds"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Server.ServerJobCancellationWatcher","Message":"Checking for aborted jobs...","State":{"Message":"Checking for aborted jobs..."}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Server.ServerJobCancellationWatcher","Message":"No newly aborted jobs found.","State":{"Message":"No newly aborted jobs found."}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to create a lock [locks:counters:aggregator]. Status - 409 Conflict. Lock already exists","State":{"Message":"Unable to create a lock [locks:counters:aggregator]. Status - 409 Conflict. Lock already exists"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to acquire lock [locks:counters:aggregator]. Will try after [2] seconds","State":{"Message":"Unable to acquire lock [locks:counters:aggregator]. Will try after [2] seconds"}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to create a lock [locks:schedulepoller]. Status - 409 Conflict. Lock already exists","State":{"Message":"Unable to create a lock [locks:schedulepoller]. Status - 409 Conflict. Lock already exists"}} {"EventId":0,"LogLevel":"Error","Category":"Hangfire.Processing.BackgroundExecution","Message":"Execution DelayedJobScheduler is in the Failed state now due to an exception, execution will be retried no more than in 00:00:04","Exception":"Hangfire.Azure.CosmosDbDistributedLockException: Could not place a lock [locks:schedulepoller]: Lock timeout reached [60] seconds. at Hangfire.Azure.CosmosDbDistributedLock.Acquire(TimeSpan timeout) at Hangfire.Azure.CosmosDbDistributedLock..ctor(String resource, TimeSpan timeout, CosmosDbStorage storage) at Hangfire.Azure.CosmosDbConnection.AcquireDistributedLock(String resource, TimeSpan timeout) at Hangfire.Server.DelayedJobScheduler.UseConnectionDistributedLock[T](JobStorage storage, Func\u00602 action) at Hangfire.Server.DelayedJobScheduler.EnqueueNextScheduledJobs(BackgroundProcessContext context) at Hangfire.Server.DelayedJobScheduler.Execute(BackgroundProcessContext context) at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state) at Hangfire.Processing.BackgroundExecution.Run(Action\u00602 callback, Object state)","State":{"Message":"Execution DelayedJobScheduler is in the Failed state now due to an exception, execution will be retried no more than in 00:00:04"}} {"EventId":0,"LogLevel":"Debug","Category":"Hangfire.Processing.BackgroundExecution","Message":"Execution loop DelayedJobScheduler:c0c3aa61 caught an exception and will be retried in 00:00:04","Exception":"Hangfire.Azure.CosmosDbDistributedLockException: Could not place a lock [locks:schedulepoller]: Lock timeout reached [60] seconds. at Hangfire.Azure.CosmosDbDistributedLock.Acquire(TimeSpan timeout) at Hangfire.Azure.CosmosDbDistributedLock..ctor(String resource, TimeSpan timeout, CosmosDbStorage storage) at Hangfire.Azure.CosmosDbConnection.AcquireDistributedLock(String resource, TimeSpan timeout) at Hangfire.Server.DelayedJobScheduler.UseConnectionDistributedLock[T](JobStorage storage, Func\u00602 action) at Hangfire.Server.DelayedJobScheduler.EnqueueNextScheduledJobs(BackgroundProcessContext context) at Hangfire.Server.DelayedJobScheduler.Execute(BackgroundProcessContext context) at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state) at Hangfire.Processing.BackgroundExecution.Run(Action\u00602 callback, Object state)","State":{"Message":"Execution loop DelayedJobScheduler:c0c3aa61 caught an exception and will be retried in 00:00:04"}} {"EventId":0,"LogLevel":"Debug","Category":"Hangfire.Processing.BackgroundExecution","Message":"Execution loop DelayedJobScheduler:c0c3aa61 will be retried in 00:00:04...","State":{"Message":"Execution loop DelayedJobScheduler:c0c3aa61 will be retried in 00:00:04..."}} {"EventId":0,"LogLevel":"Trace","Category":"Hangfire.Azure.CosmosDbDistributedLock","Message":"Unable to create a lock [recurring-jobs:lock]. Status - 409 Conflict. Lock already exists","State":{"Message":"Unable to create a lock [recurring-jobs:lock]. Status - 409 Conflict. Lock already exists"}} {"EventId":0,"LogLevel":"Error","Category":"Hangfire.Processing.BackgroundExecution","Message":"Execution RecurringJobScheduler is in the Failed state now due to an exception, execution will be retried no more than in 00:00:04","Exception":"Hangfire.Azure.CosmosDbDistributedLockException: Could not place a lock [recurring-jobs:lock]: Lock timeout reached [60] seconds. at Hangfire.Azure.CosmosDbDistributedLock.Acquire(TimeSpan timeout) at Hangfire.Azure.CosmosDbDistributedLock..ctor(String resource, TimeSpan timeout, CosmosDbStorage storage) at Hangfire.Azure.CosmosDbConnection.AcquireDistributedLock(String resource, TimeSpan timeout) at Hangfire.Server.RecurringJobScheduler.UseConnectionDistributedLock[T](JobStorage storage, Func\u00602 action) at Hangfire.Server.RecurringJobScheduler.EnqueueNextRecurringJobs(BackgroundProcessContext context) at Hangfire.Server.RecurringJobScheduler.Execute(BackgroundProcessContext context) at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state) at Hangfire.Processing.BackgroundExecution.Run(Action\u00602 callback, Object state)","State":{"Message":"Execution RecurringJobScheduler is in the Failed state now due to an exception, execution will be retried no more than in 00:00:04"}} {"EventId":0,"LogLevel":"Debug","Category":"Hangfire.Processing.BackgroundExecution","Message":"Execution loop RecurringJobScheduler:7151e5fa caught an exception and will be retried in 00:00:04","Exception":"Hangfire.Azure.CosmosDbDistributedLockException: Could not place a lock [recurring-jobs:lock]: Lock timeout reached [60] seconds. at Hangfire.Azure.CosmosDbDistributedLock.Acquire(TimeSpan timeout) at Hangfire.Azure.CosmosDbDistributedLock..ctor(String resource, TimeSpan timeout, CosmosDbStorage storage) at Hangfire.Azure.CosmosDbConnection.AcquireDistributedLock(String resource, TimeSpan timeout) at Hangfire.Server.RecurringJobScheduler.UseConnectionDistributedLock[T](JobStorage storage, Func\u00602 action) at Hangfire.Server.RecurringJobScheduler.EnqueueNextRecurringJobs(BackgroundProcessContext context) at Hangfire.Server.RecurringJobScheduler.Execute(BackgroundProcessContext context) at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state) at Hangfire.Processing.BackgroundExecution.Run(Action\u00602 callback, Object state)","State":{"Message":"Execution loop RecurringJobScheduler:7151e5fa caught an exception and will be retried in 00:00:04"}} {"EventId":0,"LogLevel":"Debug","Category":"Hangfire.Processing.BackgroundExecution","Message":"Execution loop RecurringJobScheduler:7151e5fa will be retried in 00:00:04...","State":{"Message":"Execution loop RecurringJobScheduler:7151e5fa will be retried in 00:00:04..."}}

imranmomin commented 2 years ago

@rlekni - The lock document have TTL defined. CosmosDB should remove them eventually. Please check the TTL setting for the container

https://github.com/imranmomin/Hangfire.AzureCosmosDb/blob/362a97ca46c5aa24907d8321394758a8eb85a04f/src/CosmosDbDistributedLock.cs#L93-L102

rlekni commented 2 years ago

@rlekni - The lock document have TTL defined. CosmosDB should remove them eventually. Please check the TTL setting for the container

https://github.com/imranmomin/Hangfire.AzureCosmosDb/blob/362a97ca46c5aa24907d8321394758a8eb85a04f/src/CosmosDbDistributedLock.cs#L93-L102

Ah I see, had ttl disabled on a collection. Closing this issue