imranmomin / Hangfire.AzureCosmosDb

Azure Cosmos DB storage provider for Hangfire
https://www.hangfire.io/
MIT License
18 stars 17 forks source link

Old jobs aren't deleted #37

Closed dioptre closed 2 years ago

dioptre commented 2 years ago

We have a 36 hour TTL for the jobs but they aren't being deleted.

Is there anything we should do that's undocumented?

We are using .WithJobExpirationTimeout(TimeSpan.FromHours(36));

Thanks

dioptre commented 2 years ago

Just checking @imranmomin if you have knowledge over this?

imranmomin commented 2 years ago

@dioptre - I think for some reason the expire_on is not being set on the job document.

SELECT * FROM c WHERE c.type = 2 AND NOT IS_DEFINED(doc.expire_on)

dpachla commented 2 years ago

@imranmomin

I'm working with @dioptre on this. Just wanted to mention that running this query as-is gives an error. I needed to change doc.expire_on to c.expire_on. When I run that, I get 0 results returned.

Let me provide more insight. Let's look at which items still exists where c._ts = '2202-04-01'. This is now past the 36 hours.

SELECT c.type, count(1) as cnt
FROM c
WHERE LEFT(TimestampToDateTime((c._ts - (420 * 60)) * 1000), 10) = '2022-04-01'
GROUP BY c.type
[
    {
        "type": 8,
        "cnt": 20
    },
    {
        "type": 2,
        "cnt": 10
    },
    {
        "type": 4,
        "cnt": 244719
    },
    {
        "type": 6,
        "cnt": 5
    }
]

Right away, I can see that type 4 is the major offender. These appear to be stats. Is this not controlled by the same mechanism? I don't see an attribute/column for 'expire_on' for these. Do we need to write our own clean up of the stats? How do we turn it off completely? I'm not sure we are using this at all or want to, especially if it's leaving ~250k entries every day. Ideally, stats should end up in a different location than the items for the application for performance reasons.

{
        "key": "stats:succeeded",
        "value": 1,
        "counterType": 1,
        "type": 4,
        "id": "719a0938-3e1c-4c29-be50-a6986a07b9cd",
        "_rid": "DnBJAIR2F7VDMQEAAAAAAA==",
        "_self": "dbs/DnBJAA==/colls/DnBJAIR2F7U=/docs/DnBJAIR2F7VDMQEAAAAAAA==/",
        "_etag": "\"d201cb32-0000-0300-0000-6246dad20000\"",
        "_attachments": "attachments/",
        "_ts": 1648810706
}

Type 6 items are ok. Those refer to a recurring job that was created on this date.

Type 2 items look like the root message/job to execute. You'll notice the expire_on for these is set for 30 days past the created_on.

    {
        "data": {
            "type": "Sourcetable.Domain.Mediation.MediatedMessageConsumer, Sourcetable.Domain.Mediation",
            "method": "ProcessAsync",
            "parameterTypes": "[\"Sourcetable.Messaging.Abstractions.Message, Sourcetable.Messaging.Abstractions\"]",
            "arguments": "[\"{\\\"Id\\\":\\\"8ae429b6d3c840298ce4511a16a0bc0b\\\",\\\"Body\\\":{\\\"$type\\\":\\\"Sourcetable.Domain.DataApi.FivetranSyncTable, Sourcetable.Domain.DataApi\\\",\\\"TableId\\\":\\\"e93c5e4b9e3a4c42aeddd6cc9d5c6583\\\",\\\"OrganizationId\\\":\\\"dfdfbd7bb12a49ef924c3e8614037b3c\\\"},\\\"MaxAttempts\\\":3,\\\"State\\\":{\\\"$type\\\":\\\"Sourcetable.Identity.Abstractions.SourcetableUserAuthState, Sourcetable.Identity.Abstractions\\\",\\\"RequestId\\\":\\\"1d102c8425f64cb49f312dabb224377a\\\",\\\"OriginatingRequestId\\\":\\\"ad81f647db26416f9bb7b426c0e8fb6a\\\",\\\"WorkspaceIds\\\":[],\\\"OrganizationIds\\\":[],\\\"Roles\\\":[]}}\"]"
        },
        "arguments": "[\"{\\\"Id\\\":\\\"8ae429b6d3c840298ce4511a16a0bc0b\\\",\\\"Body\\\":{\\\"$type\\\":\\\"Sourcetable.Domain.DataApi.FivetranSyncTable, Sourcetable.Domain.DataApi\\\",\\\"TableId\\\":\\\"e93c5e4b9e3a4c42aeddd6cc9d5c6583\\\",\\\"OrganizationId\\\":\\\"dfdfbd7bb12a49ef924c3e8614037b3c\\\"},\\\"MaxAttempts\\\":3,\\\"State\\\":{\\\"$type\\\":\\\"Sourcetable.Identity.Abstractions.SourcetableUserAuthState, Sourcetable.Identity.Abstractions\\\",\\\"RequestId\\\":\\\"1d102c8425f64cb49f312dabb224377a\\\",\\\"OriginatingRequestId\\\":\\\"ad81f647db26416f9bb7b426c0e8fb6a\\\",\\\"WorkspaceIds\\\":[],\\\"OrganizationIds\\\":[],\\\"Roles\\\":[]}}\"]",
        "parameters": [
            {
                "name": "CurrentCulture",
                "value": "\"\""
            },
            {
                "name": "CurrentUICulture",
                "value": "\"\""
            }
        ],
        "created_on": 1648815509,
        "type": 2,
        "id": "35259346-b060-4a13-83a0-7b345f676667",
        "expire_on": 1651407509,
        "_rid": "DnBJAIR2F7WsSwIAAAAAAA==",
        "_self": "dbs/DnBJAA==/colls/DnBJAIR2F7U=/docs/DnBJAIR2F7WsSwIAAAAAAA==/",
        "_etag": "\"d501821f-0000-0300-0000-6246ed950000\"",
        "_attachments": "attachments/",
        "_ts": 1648815509
    }

Type 8 items are related to the actions taken against the type 2 items. Those will delete when the type 2 items delete. However, there a couple that don't have a corresponding existing type 2 item.

The questions that need answering are:

  1. Why is the expire_on value set to 30 days when we have an overall value set to 36 hours? The rest of the items for the same type of jobs are being cleaned up correctly after the 36 hours.
  2. How are there dangling type 8 jobs with no corresponding type 2?

Thanks for the help!!!

imranmomin commented 2 years ago

@dpachla @dioptre

Thank you for the data.

So whenever the job is being created the default is set to 30 days.

https://github.com/HangfireIO/Hangfire/blob/master/src/Hangfire.Core/Client/CoreBackgroundJobFactory.cs#L65-L76

Once the job completes or has some state the Hangfire.Core will set the new expiration based on the configuration and looks like the issue is in my library which will not set the new expiry.

Good news, I have been working on couple of fixes and this issue was part of the fix. and will be soon releasing the new package v2.0.0. But this will only fix and new job that are created. I think for old data you will have to run a query to fix the expire_on

Regarding the stats they all should get summarize and deleted and move into counterType: 2


{
        "key": "stats:succeeded",
        "value": 1,
        "counterType": 2,
        "type": 4,
        "id": "stats:succeeded",
        "_rid": "DnBJAIR2F7VXMQEAAAAAAA==",
        "_self": "dbs/DnBJAA==/colls/DnBJAIR2F7U=/docs/DnBJAIR2F7VXMQEAAAAAAA==/",
        "_etag": "\"d201cb32-0000-0300-0000-6246dad2000X\"",
        "_attachments": "attachments/",
        "_ts": 1648810707
}
imranmomin commented 2 years ago

Version 2.0.0 has been released

dotnet add package Hangfire.AzureCosmosDB --version 2.0.0

imranmomin commented 2 years ago

I hope this new version will fix the issue and bring more stability

dpachla commented 2 years ago

Thank you!

dpachla commented 2 years ago

Here is the query to check the daily counts for PDT.

SELECT c.state_name , LEFT(TimestampToDateTime((c._ts - (420 60)) 1000), 10) as day , count(1) as cnt FROM c GROUP BY c.state_name , LEFT(TimestampToDateTime((c._ts - (420 60)) 1000), 10)

dpachla commented 2 years ago

After clearing the prod Hangfires container on 4/27, everything is working as expected.