Jobs may stuck in Enqueued state after app crash/restart

f1nzer commented 2 years ago

In an unstable environment where an application may crash or restart due to some external issue, there may be a case where some jobs may hang and never be moved to the processing state.

In my case there are 6 jobs that are in Enqueued state, but I can't see them via the dashboard (only count is displayed).

Looks like an item with type DocumentTypes.Queue was fetched using a JobQueue class and then the application crashed or something like that. There is data from CosmosDb related to the document:

SELECT * FROM doc WHERE doc.job_id = 'e713eaed-5529-4dac-bcda-6452879ed1eb' or doc.id = 'e713eaed-5529-4dac-bcda-6452879ed1eb'

[
    {
        "data": {
            "type": "#, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null",
            "method": "RunAsync",
            "parameterTypes": "[\"#, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null\"]",
            "arguments": "#"
        },
        "arguments": "#",
        "parameters": [
            {
                "name": "CurrentCulture",
                "value": "\"\""
            },
            {
                "name": "CurrentUICulture",
                "value": "\"\""
            }
        ],
        "created_on": 1655817700,
        "type": 2,
        "id": "e713eaed-5529-4dac-bcda-6452879ed1eb",
        "_rid": "qcwkAMDjJzg14gMAAAAAAA==",
        "_self": "dbs/qcwkAA==/colls/qcwkAMDjJzg=/docs/qcwkAMDjJzg14gMAAAAAAA==/",
        "_etag": "\"fa003a2b-0000-0d00-0000-62b1c5e40000\"",
        "_attachments": "attachments/",
        "state_id": "75103e38-6bd9-42a2-b7ae-254a121728b5",
        "state_name": "Enqueued",
        "_ts": 1655817700
    },
    {
        "job_id": "e713eaed-5529-4dac-bcda-6452879ed1eb",
        "name": "Enqueued",
        "created_on": 1655817700,
        "data": {
            "EnqueuedAt": "2022-06-21T13:21:40.1279900Z",
            "Queue": "default"
        },
        "type": 8,
        "id": "75103e38-6bd9-42a2-b7ae-254a121728b5",
        "_rid": "qcwkAMDjJzg44gMAAAAAAA==",
        "_self": "dbs/qcwkAA==/colls/qcwkAMDjJzg=/docs/qcwkAMDjJzg44gMAAAAAAA==/",
        "_etag": "\"fa00292b-0000-0d00-0000-62b1c5e40000\"",
        "_attachments": "attachments/",
        "_ts": 1655817700
    }
]

imranmomin commented 2 years ago

When a job is dequeued it updates the fetched_at with current utc. The document is only removed if the job completes and the method RemoveFromQueue is invoked.

My guess is that after the job completed it mostly likely failed with other housekeeping tasks were called. i.e update state, counters and so on.

If you can provide logs we can surely look further into it

f1nzer commented 2 years ago

Unfortunately, there are no Hangfire related warnings/errors.

I have enabled additional logging to catch such problems in future.

f1nzer commented 2 years ago

My guess is that after the job completed it mostly likely failed with other housekeeping tasks were called. i.e update state, counters and so on.

Most likely the job was stored (+ state), but a Queue entity was not created. Probably, because it was scheduled in CosmosDbWriteOnlyTransaction but then due to app crash it was not executed (committed).

https://github.com/imranmomin/Hangfire.AzureCosmosDb/blob/6e6b0400f02b4cf490edd2e29f278892ccaf8afa/src/CosmosDbWriteOnlyTransaction.cs#L71-L84

I think the only thing I can do there (at least in my bad environment) is to check for those "hung" jobs on app startup and then manually create Queue entities for them, but there is no queue name in those jobs to do that.

imranmomin / Hangfire.AzureCosmosDb

Jobs may stuck in Enqueued state after app crash/restart #49