Duplicate job runs after 15 mins of execution because `fetchedAt` is not getting updated

sanisoclem commented 1 year ago

When executing long running jobs, the job re-runs again after ~15 mins. After reading through the code a bit, it seems it is because it picks up the job again after the invisibility timeout of 15 minutes. ~~I don't see anyway to override this, is there any way around this?~~

Update: The jobs are being run again because the fetched_at field is not getting updated because the latest etag value is not used. see: https://github.com/imranmomin/Hangfire.AzureCosmosDb/issues/54#issuecomment-1386891221

Thanks

imranmomin commented 1 year ago

@sanisoclem when a job is fetched, a callback will keep updating the fetched_at property until the job is completed.

The invisibility timeout is for use case when the job stop due to some exception and so it can re-try after the timeout

You can further investigate, if the callback is doing its task https://github.com/imranmomin/Hangfire.AzureCosmosDb/blob/0a11016f6b21428d4e95bc4657b429ce5c25d1b4/src/Queue/Entities/FetchedJob.cs#L115-L154

sanisoclem commented 1 year ago

Hi thanks for looking into this. I have checked directly in cosmos db, while fetched_at a few seconds after the 'processing' document has been created but is not updated after that (it will be updated when it starts processing again).

I suspect its because the timer for the callback has an infinite interval.

https://github.com/imranmomin/Hangfire.AzureCosmosDb/blob/0a11016f6b21428d4e95bc4657b429ce5c25d1b4/src/Queue/Entities/FetchedJob.cs#L135-L136

https://github.com/imranmomin/Hangfire.AzureCosmosDb/blob/0a11016f6b21428d4e95bc4657b429ce5c25d1b4/src/Queue/Entities/FetchedJob.cs#L32

if I try to emulate this locally using:

var s = DateTime.UtcNow;
var timer = new System.Threading.Timer(t =>
{
    var n = DateTime.UtcNow - s;
    Console.WriteLine($"+{Math.Round(n.TotalSeconds, 2)} - Timer invoked");

}, null, 1000, 5000);

await Task.Delay(100000);

I get the following output:

+1 - Timer invoked
+6.01 - Timer invoked
+11 - Timer invoked
+16 - Timer invoked
+21.01 - Timer invoked

And setting the last parameter to infinite only invokes the timer once.

imranmomin commented 1 year ago

@sanisoclem yes that is the idea to invoke once and schedule the next. This avoid multiple threads if one got delay to finish the callback

sanisoclem commented 1 year ago

Apologies, I did not read that correctly. I'll have a look again - the fetched_at field is not getting updated for me (from what I understand, this should be updated every 15 seconds by default, while the job is running).

sanisoclem commented 1 year ago

I finally had some time to investigate this. It turns out that only the first keep alive call succeeds, the subsequence ones fail with the error below:

dbug: Hangfire.Azure.Queue.FetchedJob[0]
      Unable to execute keep-alive query for job [3b4adeac-1600-4e17-88b4-c702d14d50ab]
      System.AggregateException: One or more errors occurred. (Response status code does not indicate success: PreconditionFailed (412); Substatus: 0; ActivityId: 588e58bb-2fad-4d43-acdb-116a0aab26a8; Reason: (
      Errors : [
        "One of the specified pre-condition is not met. Learn more: https://aka.ms/CosmosDB/sql/errors/precondition-failed"
      ]
      );)
       ---> Microsoft.Azure.Cosmos.CosmosException : Response status code does not indicate success: PreconditionFailed (412); Substatus: 0; ActivityId: 588e58bb-2fad-4d43-acdb-116a0aab26a8; Reason: (
      Errors : [
        "One of the specified pre-condition is not met. Learn more: https://aka.ms/CosmosDB/sql/errors/precondition-failed"
      ]
      );
         at Microsoft.Azure.Cosmos.ResponseMessage.EnsureSuccessStatusCode()
         at Microsoft.Azure.Cosmos.CosmosResponseFactoryCore.ProcessMessage[T](ResponseMessage responseMessage, Func`2 createResponse)
         at Microsoft.Azure.Cosmos.ContainerCore.PatchItemAsync[T](String id, PartitionKey partitionKey, IReadOnlyList`1 patchOperations, ITrace trace, PatchItemRequestOptions requestOptions, CancellationToken cancellationToken)
         at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](ITrace trace, Func`2 task, Func`2 openTelemetry, String operationName, RequestOptions requestOptions)
         at Microsoft.Azure.Cosmos.ClientContextCore.OperationHelperWithRootTraceAsync[TResult](String operationName, RequestOptions requestOptions, Func`2 task, Func`2 openTelemetry, TraceComponent traceComponent, TraceLevel traceLevel)
         at Hangfire.Azure.Helper.ClientHelper.ExecuteWithRetriesAsync[T](Task`1 task)
         at Hangfire.Azure.Helper.ClientHelper.ExecuteWithRetriesAsync[T](Task`1 task)
         at Hangfire.Azure.Helper.AsyncHelper.<>c__DisplayClass1_0`1.<<ExecuteSynchronously>b__0>d.MoveNext()
      --- Cosmos Diagnostics ---REDACTED
         --- End of inner exception stack trace ---
         at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
         at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
         at Hangfire.Azure.Helper.AsyncHelper.ExecuteSynchronously[T](Task`1 task)
         at Hangfire.Azure.Helper.ClientHelper.PatchItemWithRetries[T](Container container, String id, PartitionKey partitionKey, IReadOnlyList`1 patchOperations, PatchItemRequestOptions patchItemRequestOptions)         at Hangfire.Azure.Queue.FetchedJob.KeepAliveJobCallback(Object obj)

I think the etag does not get updated between invocations of the timer because it is taken from obj which as far as I can tell is not updated between invocations:

https://github.com/imranmomin/Hangfire.AzureCosmosDb/blob/0a11016f6b21428d4e95bc4657b429ce5c25d1b4/src/Queue/Entities/FetchedJob.cs#L122-L132

Not sure if there is any problem with getting the etag from data (and probably ignore obj all together), but that seems like it should fix this issue. I'm willing to submit a PR if you are open to it.

Cheers

imranmomin commented 1 year ago

When the last update happens the obj does not have the latest Etag. Yes we can ignore the obj and use the data directly.

Please open the PR

deanparkr commented 1 year ago

Experiencing this issue as well. Wondered when this is planned to go into a release?

imranmomin commented 1 year ago

Sorry, the package is now released

https://www.nuget.org/packages/Hangfire.AzureCosmosDB/2.0.6

imranmomin / Hangfire.AzureCosmosDb

Duplicate job runs after 15 mins of execution because `fetchedAt` is not getting updated #54