Closed sanisoclem closed 1 year ago
@sanisoclem when a job is fetched, a callback will keep updating the fetched_at
property until the job is completed.
The invisibility timeout is for use case when the job stop due to some exception and so it can re-try after the timeout
You can further investigate, if the callback is doing its task https://github.com/imranmomin/Hangfire.AzureCosmosDb/blob/0a11016f6b21428d4e95bc4657b429ce5c25d1b4/src/Queue/Entities/FetchedJob.cs#L115-L154
Hi thanks for looking into this. I have checked directly in cosmos db, while fetched_at
a few seconds after the 'processing' document has been created but is not updated after that (it will be updated when it starts processing again).
I suspect its because the timer for the callback has an infinite interval.
if I try to emulate this locally using:
var s = DateTime.UtcNow;
var timer = new System.Threading.Timer(t =>
{
var n = DateTime.UtcNow - s;
Console.WriteLine($"+{Math.Round(n.TotalSeconds, 2)} - Timer invoked");
}, null, 1000, 5000);
await Task.Delay(100000);
I get the following output:
+1 - Timer invoked
+6.01 - Timer invoked
+11 - Timer invoked
+16 - Timer invoked
+21.01 - Timer invoked
And setting the last parameter to infinite only invokes the timer once.
@sanisoclem yes that is the idea to invoke once and schedule the next. This avoid multiple threads if one got delay to finish the callback
Apologies, I did not read that correctly. I'll have a look again - the fetched_at
field is not getting updated for me (from what I understand, this should be updated every 15 seconds by default, while the job is running).
I finally had some time to investigate this. It turns out that only the first keep alive call succeeds, the subsequence ones fail with the error below:
dbug: Hangfire.Azure.Queue.FetchedJob[0]
Unable to execute keep-alive query for job [3b4adeac-1600-4e17-88b4-c702d14d50ab]
System.AggregateException: One or more errors occurred. (Response status code does not indicate success: PreconditionFailed (412); Substatus: 0; ActivityId: 588e58bb-2fad-4d43-acdb-116a0aab26a8; Reason: (
Errors : [
"One of the specified pre-condition is not met. Learn more: https://aka.ms/CosmosDB/sql/errors/precondition-failed"
]
);)
---> Microsoft.Azure.Cosmos.CosmosException : Response status code does not indicate success: PreconditionFailed (412); Substatus: 0; ActivityId: 588e58bb-2fad-4d43-acdb-116a0aab26a8; Reason: (
Errors : [
"One of the specified pre-condition is not met. Learn more: https://aka.ms/CosmosDB/sql/errors/precondition-failed"
]
);
at Microsoft.Azure.Cosmos.ResponseMessage.EnsureSuccessStatusCode()
at Microsoft.Azure.Cosmos.CosmosResponseFactoryCore.ProcessMessage[T](ResponseMessage responseMessage, Func`2 createResponse)
at Microsoft.Azure.Cosmos.ContainerCore.PatchItemAsync[T](String id, PartitionKey partitionKey, IReadOnlyList`1 patchOperations, ITrace trace, PatchItemRequestOptions requestOptions, CancellationToken cancellationToken)
at Microsoft.Azure.Cosmos.ClientContextCore.RunWithDiagnosticsHelperAsync[TResult](ITrace trace, Func`2 task, Func`2 openTelemetry, String operationName, RequestOptions requestOptions)
at Microsoft.Azure.Cosmos.ClientContextCore.OperationHelperWithRootTraceAsync[TResult](String operationName, RequestOptions requestOptions, Func`2 task, Func`2 openTelemetry, TraceComponent traceComponent, TraceLevel traceLevel)
at Hangfire.Azure.Helper.ClientHelper.ExecuteWithRetriesAsync[T](Task`1 task)
at Hangfire.Azure.Helper.ClientHelper.ExecuteWithRetriesAsync[T](Task`1 task)
at Hangfire.Azure.Helper.AsyncHelper.<>c__DisplayClass1_0`1.<<ExecuteSynchronously>b__0>d.MoveNext()
--- Cosmos Diagnostics ---REDACTED
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at Hangfire.Azure.Helper.AsyncHelper.ExecuteSynchronously[T](Task`1 task)
at Hangfire.Azure.Helper.ClientHelper.PatchItemWithRetries[T](Container container, String id, PartitionKey partitionKey, IReadOnlyList`1 patchOperations, PatchItemRequestOptions patchItemRequestOptions) at Hangfire.Azure.Queue.FetchedJob.KeepAliveJobCallback(Object obj)
I think the etag does not get updated between invocations of the timer because it is taken from obj
which as far as I can tell is not updated between invocations:
Not sure if there is any problem with getting the etag from data
(and probably ignore obj
all together), but that seems like it should fix this issue. I'm willing to submit a PR if you are open to it.
Cheers
When the last update happens the obj
does not have the latest Etag
. Yes we can ignore the obj
and use the data
directly.
Please open the PR
Experiencing this issue as well. Wondered when this is planned to go into a release?
Sorry, the package is now released
When executing long running jobs, the job re-runs again after ~15 mins. After reading through the code a bit, it seems it is because it picks up the job again after the invisibility timeout of 15 minutes.
I don't see anyway to override this, is there any way around this?Update: The jobs are being run again because the
fetched_at
field is not getting updated because the latest etag value is not used. see: https://github.com/imranmomin/Hangfire.AzureCosmosDb/issues/54#issuecomment-1386891221Thanks