On environment with multiple hangfire servers. Recurring job sometimes failed to run

hquangthinh commented 3 years ago

I have 2 hangfire servers with many jobs backed by cosmosdb storage. The recurring job sometimes failed to run and the error trace is

2021-08-01 00:00:04.204 +00:00 [DBG] [RecurringJobScheduler #1] Execution loop RecurringJobScheduler:6610b164 caught an exception and will be retried in 00:05:00 System.ArgumentException: An item with the same key has already been added. Key: LastExecution at System.Collections.Generic.Dictionary2.TryInsert(TKey key, TValue value, InsertionBehavior behavior) at System.Linq.Enumerable.ToDictionary[TSource,TKey,TElement](IEnumerable1 source, Func2 keySelector, Func2 elementSelector, IEqualityComparer1 comparer) at Hangfire.RecurringJobExtensions.GetRecurringJob(IStorageConnection connection, String recurringJobId, ITimeZoneResolver timeZoneResolver, DateTime now) at Hangfire.Server.RecurringJobScheduler.TryEnqueueBackgroundJob(BackgroundProcessContext context, IStorageConnection connection, String recurringJobId, DateTime now) at Hangfire.Server.RecurringJobScheduler.<>c__DisplayClass18_0.b__0(IStorageConnection connection) at Hangfire.Server.RecurringJobScheduler.UseConnectionDistributedLock[T](JobStorage storage, Func2 action) at Hangfire.Server.RecurringJobScheduler.Execute(BackgroundProcessContext context) at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state) at Hangfire.Processing.BackgroundExecution.Run(Action2 callback, Object state) `

When this happened in the container of hangfire there're 2 items of same recurring job has the same field LastExecution

Running this query

SELECT COUNT(1) AS groupCount, c.key FROM c WHERE c.field = 'LastExecution' GROUP BY c.key

Will has the result

{ "groupCount": 2, "key": "recurring-job:Job_Name" }

The work around is to delete 1 record after running this query

select * from c where c.key = 'recurring-job:Job_Name' and c.field = 'LastExecution'

hquangthinh commented 3 years ago

Hi @imranmomin it seems a proper fix should be at CosmosDbDistributedLock but I have no idea yet another fix that I can make a pull request is at this method

`public override Dictionary<string, string> GetAllEntriesFromHash(string key) { if (key == null) throw new ArgumentNullException(nameof(key));

        return Storage.Container.GetItemLinqQueryable<Hash>(requestOptions: new QueryRequestOptions { PartitionKey = new PartitionKey((int)DocumentTypes.Hash) })
            .Where(h => h.DocumentType == DocumentTypes.Hash && h.Key == key)
            .Select(h => new { h.Field, h.Value })
            .ToQueryResult()
            .GroupBy(h => h.Field, StringComparer.OrdinalIgnoreCase)
            .ToDictionary(hg => hg.Key, hg => hg.First().Value);
    }`

do you think this fix is eligible?

imranmomin commented 3 years ago

@hquangthinh - is this the only instance you have where the same key existed. If so can you do the work around and delete the extra key.

I'm not sure if it is the case where multiple job server trying to add the same key.

hquangthinh commented 3 years ago

hi @imranmomin in my production I scale 2-5 hangfire servers each server hosted in console app in aks pod and this issue happen repeatedly I have a code that can reproduce this issue locally by creating 2-3 hangfire servers during startup

` public class Startup { public Startup(IConfiguration configuration) { Configuration = configuration; }

    public IConfiguration Configuration { get; }

    // This method gets called by the runtime. Use this method to add services to the container.
    public void ConfigureServices(IServiceCollection services)
    {
        services.AddControllers();
        services.AddHangfireServer(optionsAction => 
        {
            optionsAction.WorkerCount = 20;
        });

        services.AddHangfireServer(optionsAction =>
        {
            optionsAction.ServerName = "Server2";
            optionsAction.WorkerCount = 25;
        });

        //services.AddHangfireServer(optionsAction =>
        //{
        //    optionsAction.ServerName = "Server3";
        //    optionsAction.WorkerCount = 25;
        //});

        services.AddScoped<ToDoService>();

        // cosmos url and key
        string url = "", secretkey = "", database = "", collection = "";

        var storageOptions = new CosmosDbStorageOptions
        {
            ExpirationCheckInterval = TimeSpan.FromMinutes(2),
            CountersAggregateInterval = TimeSpan.FromMinutes(2),
            QueuePollInterval = TimeSpan.FromSeconds(15)
        };

        var cosmoClientOptions = new CosmosClientOptions
        {
            RequestTimeout = TimeSpan.FromSeconds(60),
            ConnectionMode = ConnectionMode.Direct,
            MaxRetryAttemptsOnRateLimitedRequests = 3,
            // wait 30s before retry n times. n is set in MaxRetryAttemptsOnRateLimitedRequests which is only 3 in this case
            // Default value is 9 which can cause high usage in RU for cosmos
            MaxRetryWaitTimeOnRateLimitedRequests = TimeSpan.FromSeconds(30)
        };

        services.AddHangfire(o => {
            o.UseAzureCosmosDbStorage(url, secretkey, database, collection, cosmoClientOptions, storageOptions);
        });

        JobStorage.Current = new CosmosDbStorage(url, secretkey, database, collection, cosmoClientOptions, storageOptions);

    }

    // This method gets called by the runtime. Use this method to configure the HTTP request pipeline.
    public void Configure(IApplicationBuilder app, IWebHostEnvironment env, IServiceProvider serviceProvider)
    {
        if (env.IsDevelopment())
        {
            app.UseDeveloperExceptionPage();
        }

        app.UseHttpsRedirection();

        app.UseStaticFiles();

        app.UseHangfireDashboard();

        app.UseRouting();

        app.UseAuthorization();

        app.UseEndpoints(endpoints =>
        {
            endpoints.MapControllers();
        });

        RecurringJob.AddOrUpdate("TO_DO_TASK_JOB"
            , () => serviceProvider.GetRequiredService<ToDoService>().DoTask()
            , Cron.Minutely()
        );
    }
}

`

imranmomin commented 3 years ago

@hquangthinh - I have a build from develop. Let me know if this fixes the issue

https://www.nuget.org/packages/Hangfire.AzureCosmosDB/1.3.0-develop

dotnet add package Hangfire.AzureCosmosDB --version 1.3.0-develop

hquangthinh commented 3 years ago

@imranmomin I've tried the version 1.3.0-develop however the issue still occurs I have a repo here to reproduce with 1.3.0-develop https://github.com/hquangthinh/HangfireCosmosDB.WebAppHostTest

imranmomin commented 3 years ago

@hquangthinh - thank you. I will check the repo and try to reproduce the issue locally.

imranmomin commented 2 years ago

@hquangthinh - I did some fixes and improvement and I think it fixes the issue. I tried on your repo and it was running successfully without any error.

If you can pull /patch_api_update branch and test on side to confirm the fix

imranmomin / Hangfire.AzureCosmosDb

On environment with multiple hangfire servers. Recurring job sometimes failed to run #21