Closed hquangthinh closed 2 years ago
Hi @imranmomin it seems a proper fix should be at CosmosDbDistributedLock but I have no idea yet another fix that I can make a pull request is at this method
`public override Dictionary<string, string> GetAllEntriesFromHash(string key) { if (key == null) throw new ArgumentNullException(nameof(key));
return Storage.Container.GetItemLinqQueryable<Hash>(requestOptions: new QueryRequestOptions { PartitionKey = new PartitionKey((int)DocumentTypes.Hash) })
.Where(h => h.DocumentType == DocumentTypes.Hash && h.Key == key)
.Select(h => new { h.Field, h.Value })
.ToQueryResult()
.GroupBy(h => h.Field, StringComparer.OrdinalIgnoreCase)
.ToDictionary(hg => hg.Key, hg => hg.First().Value);
}`
do you think this fix is eligible?
@hquangthinh - is this the only instance you have where the same key existed. If so can you do the work around and delete the extra key.
I'm not sure if it is the case where multiple job server trying to add the same key.
hi @imranmomin in my production I scale 2-5 hangfire servers each server hosted in console app in aks pod and this issue happen repeatedly I have a code that can reproduce this issue locally by creating 2-3 hangfire servers during startup
` public class Startup { public Startup(IConfiguration configuration) { Configuration = configuration; }
public IConfiguration Configuration { get; }
// This method gets called by the runtime. Use this method to add services to the container.
public void ConfigureServices(IServiceCollection services)
{
services.AddControllers();
services.AddHangfireServer(optionsAction =>
{
optionsAction.WorkerCount = 20;
});
services.AddHangfireServer(optionsAction =>
{
optionsAction.ServerName = "Server2";
optionsAction.WorkerCount = 25;
});
//services.AddHangfireServer(optionsAction =>
//{
// optionsAction.ServerName = "Server3";
// optionsAction.WorkerCount = 25;
//});
services.AddScoped<ToDoService>();
// cosmos url and key
string url = "", secretkey = "", database = "", collection = "";
var storageOptions = new CosmosDbStorageOptions
{
ExpirationCheckInterval = TimeSpan.FromMinutes(2),
CountersAggregateInterval = TimeSpan.FromMinutes(2),
QueuePollInterval = TimeSpan.FromSeconds(15)
};
var cosmoClientOptions = new CosmosClientOptions
{
RequestTimeout = TimeSpan.FromSeconds(60),
ConnectionMode = ConnectionMode.Direct,
MaxRetryAttemptsOnRateLimitedRequests = 3,
// wait 30s before retry n times. n is set in MaxRetryAttemptsOnRateLimitedRequests which is only 3 in this case
// Default value is 9 which can cause high usage in RU for cosmos
MaxRetryWaitTimeOnRateLimitedRequests = TimeSpan.FromSeconds(30)
};
services.AddHangfire(o => {
o.UseAzureCosmosDbStorage(url, secretkey, database, collection, cosmoClientOptions, storageOptions);
});
JobStorage.Current = new CosmosDbStorage(url, secretkey, database, collection, cosmoClientOptions, storageOptions);
}
// This method gets called by the runtime. Use this method to configure the HTTP request pipeline.
public void Configure(IApplicationBuilder app, IWebHostEnvironment env, IServiceProvider serviceProvider)
{
if (env.IsDevelopment())
{
app.UseDeveloperExceptionPage();
}
app.UseHttpsRedirection();
app.UseStaticFiles();
app.UseHangfireDashboard();
app.UseRouting();
app.UseAuthorization();
app.UseEndpoints(endpoints =>
{
endpoints.MapControllers();
});
RecurringJob.AddOrUpdate("TO_DO_TASK_JOB"
, () => serviceProvider.GetRequiredService<ToDoService>().DoTask()
, Cron.Minutely()
);
}
}
`
@hquangthinh - I have a build from develop. Let me know if this fixes the issue
https://www.nuget.org/packages/Hangfire.AzureCosmosDB/1.3.0-develop
dotnet add package Hangfire.AzureCosmosDB --version 1.3.0-develop
@imranmomin I've tried the version 1.3.0-develop however the issue still occurs I have a repo here to reproduce with 1.3.0-develop https://github.com/hquangthinh/HangfireCosmosDB.WebAppHostTest
@hquangthinh - thank you. I will check the repo and try to reproduce the issue locally.
@hquangthinh - I did some fixes and improvement and I think it fixes the issue. I tried on your repo and it was running successfully without any error.
If you can pull /patch_api_update
branch and test on side to confirm the fix
I have 2 hangfire servers with many jobs backed by cosmosdb storage. The recurring job sometimes failed to run and the error trace is
2021-08-01 00:00:04.204 +00:00 [DBG] [RecurringJobScheduler #1] Execution loop RecurringJobScheduler:6610b164 caught an exception and will be retried in 00:05:00 System.ArgumentException: An item with the same key has already been added. Key: LastExecution at System.Collections.Generic.Dictionary
2.TryInsert(TKey key, TValue value, InsertionBehavior behavior) at System.Linq.Enumerable.ToDictionary[TSource,TKey,TElement](IEnumerable1 source, Func
2 keySelector, Func2 elementSelector, IEqualityComparer
1 comparer) at Hangfire.RecurringJobExtensions.GetRecurringJob(IStorageConnection connection, String recurringJobId, ITimeZoneResolver timeZoneResolver, DateTime now) at Hangfire.Server.RecurringJobScheduler.TryEnqueueBackgroundJob(BackgroundProcessContext context, IStorageConnection connection, String recurringJobId, DateTime now) at Hangfire.Server.RecurringJobScheduler.<>c__DisplayClass18_0.2 action) at Hangfire.Server.RecurringJobScheduler.Execute(BackgroundProcessContext context) at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state) at Hangfire.Processing.BackgroundExecution.Run(Action
2 callback, Object state) `When this happened in the container of hangfire there're 2 items of same recurring job has the same field LastExecution
Running this query
SELECT COUNT(1) AS groupCount, c.key FROM c WHERE c.field = 'LastExecution' GROUP BY c.key
Will has the result
{ "groupCount": 2, "key": "recurring-job:Job_Name" }
The work around is to delete 1 record after running this query
select * from c where c.key = 'recurring-job:Job_Name' and c.field = 'LastExecution'