Closed vadim-kupovykh closed 7 months ago
Hi,
Please share additional info:
Hi, Sure
services.AddHangfire(globalConfiguration =>
{
globalConfiguration.UsePostgreSqlStorage(Configuration.GetConnectionString(HangfireExtensions.HangfireDatabase));
globalConfiguration.UseSerializerSettings(HangfireExtensions.HangfireJsonSerializerSettings);
globalConfiguration.UseTagsWithPostgreSql();
});
services.AddHangfireServer(options =>
{
options.WorkerCount = agentSettings.GetValue("WorkerCount", 15);
options.Queues = targetQueues;
});
We have in GCP a few servers. Each server could have specific list of queues. Independent database for Hangfire as service in GCP. Database CPU - 100%, memory - around 50%.
We can add additional instances fast. However, we have that issue when we have even 1 server.
Execution plan of that query
13 secs latency for an index scan. And it's slowly degrading. We have 60k jobs in the queue and hangfire cannot start processing more than 5-10 jobs. Now we are not even creating jobs, just trying to process existing. So, technically we cannot add additional servers for horizontal scaling, because DB is dying. Do you have any ideas about what are we doing wrong?
It seems like the database is feeling better when the queue is smaller. Latency better and better with decreasing queue.
Any suggestions? :)
Try to reduce increase polling interval for workers. There must be a settings for this inside of in AddHangfireServer
or UsePostgresSqlStorage
calls.
If you have lots of workers that constantly poll the db, but never get any jobs (e.g because their own queue is empty), this can create unnecessary stress.
That is not the desired behavior. We have small jobs and all servers could process those. So, technically increasing the pooling interval (15 seconds now) will mean that workers are just doing nothing sometimes. Or am I wrong?
And here are transactions per second for the database. Take a look at performance. Amount of jobs in queue decreasing and performance is growing. In the end we don't have jobs. We are talking about 10k-50k jobs in the queue. Hangfire is processing it for so long due to issues with fetching jobs for processing. Jobs duration 5-50ms.
That is not the desired behavior. We have small jobs and all servers could process those. So, technically increasing the pooling interval (15 seconds now) will mean that workers are just doing nothing sometimes. Or am I wrong?
Technically, yes; in practice, depending on the server which enqueues the job, it can be picked up by a worker in the same server instance immediately because of AutoResetEvent
.
Partial screenshot of the execution plan has little value. Can you provide the actual query plan from PG in JSON format instead? I can't open any of these images in full size, personally (404).
EDIT: Also, what is behind UseTagsWithPostgreSql()
?
Generated a million rows in the table. Here are the results. Even with a milion rows, query runs in sub-1s.
Generated a million rows in the table. Here are the results. Even with a milion rows, query runs in sub-1s.
That is my expectation. :) Will try to share the JSON plan later. I cannot it extract from GCP so simply. UseTagsWithPostgreSql enables Tags functionality by creating an independent table. It doesn't affect the initial table and selecting from jobqueue. OK, we will try to increase pooling interval.
I'd suggest trying manual vacuum on the table, if GCP allows that in some way. We've had some issues with autovacuum not doing the job so well in our business apps.
I'd suggest trying manual vacuum on the table, if GCP allows that in some way. We've had some issues with autovacuum not doing the job so well in our business apps.
Tried, didn't help :(
So, the reason is 100% related to amount of records in jobqueu. That subquery is the issue
SELECT "id"
FROM "hangfire"."jobqueue"
WHERE
"queue" = ANY (ARRAY['kafka-consumption'])
AND ("fetchedat" IS NULL
OR "fetchedat" < NOW() + INTERVAL '1 YEAR')
ORDER BY
"fetchedat" NULLS FIRST,
"queue",
"jobid"
for update skip locked
LIMIT 1;
Here is the plan for the query https://explain.dalibo.com/plan/ga35182eddag2f6h#
As I understand we have problem because of for update skip locked
. Without for update skip locked
plan: https://explain.dalibo.com/plan/76gad6c1eec5bd4e But honestly, I don't understand why the plan is so different.
https://postgrespro.com/list/thread-id/2505440 Here a similar problem is discussed.
I quickly scanned through, and one recommendation is to introduce jitter lag to polling threads to decrease contention.
Maybe there is something better, need to read until the end.
https://postgrespro.com/list/thread-id/2505440 Here a similar problem is discussed.
I quickly scanned through, and one recommendation is to introduce jitter lag to polling threads to decrease contention.
Maybe there is something better, need to read until the end.
Checked that thread. Interesting, but I don’t understand what can we do. Will Increasing pooling interval help us? Honestly I don’t understand how to use Postgres in production now. We are doing everything by best practices, however additional workers fully destroying performance of job processing. Why nobody else has that issue? Or nobody is using Postgres in production?
Plenty of production applications use both PostgeSQL and this package. Each environment has different capabilities, bandwidth, hardware and so on, so each environment can behave differently.
I personally have no experience with GCP so I can't really say anything about it.
Plenty of production applications use both PostgeSQL and this package. Each environment has different capabilities, bandwidth, hardware and so on, so each environment can behave differently.
I personally have no experience with GCP so I can't really say anything about it.
Yeah, just topic above is not related to hangfire and the current package, but Postgres. That is what I don’t understand why we have that issue but nobody else. What are we doing differently?:) I do see that issue related to processing “for update skip locked” in many threads. Any ideas what can we do with that?
Oh, yeah, it's not about Hangfire.PostgreSql storage provider, but it is about PostgreSQL itself, which the provider does use.
I have to admit I've not dug into the performance/indexing/etc mechanisms in PG as I had no need for it. If GCP has a limited set of operations that you can perform on the database directly (again, no idea how GCP works), my suggestion is to perhaps try Redis provider. At least according to Hangfire docs it has a huge throughput increase compared to SQL (makes sense, since Redis is in-memory cache).
Oh, yeah, it's not about Hangfire.PostgreSql storage provider, but it is about PostgreSQL itself, which the provider does use.
I have to admit I've not dug into the performance/indexing/etc mechanisms in PG as I had no need for it. If GCP has a limited set of operations that you can perform on the database directly (again, no idea how GCP works), my suggestion is to perhaps try Redis provider. At least according to Hangfire docs it has a huge throughput increase compared to SQL (makes sense, since Redis is in-memory cache).
I believe you don’t understand the severity of that issue. Topic above is not related to GCP but yes, to something specific inside of Postgres. Until we understand what is that and how to mitigate it everyone who is using Hangfire with current package under risk of unpredictable performance degradation. It’s time bomb. It could happen once you try to add additional workers, or maybe not. Until we know what is that - production risks are too high for everyone.
I agree; to an extent.
That could happen with or without this package, as the linked mailing list shows. Without a reproduction there's virtually no way to see what can or can not happen.
I understand how you feel, being the only one which reported the issue in this repo (I'm not saying you're the only one that's impacted; just about reporting). Unfortunately, you're the one that has a reproduction and means to see whether any changes to the queries are valid or not. Since it's an OSS package, you can submit a pull request if there's a possible fix.
@vadim-kupovykh Can you provide a minimal runnable sample app that reproduces your issue say within a docker environment?
I may have some free time this weekend to try come up with something.
I have only 2 ideas so far:
From what I understand, it's really about many workers trying to handle the same table, as in essence SKIP LOCKED
should simply skip the locked rows (read - locked by other workers), but feels like postgresql can choke a little with many rows and many concurrent updates.
I'm not convinced there's a good way to just drop the ordering, even if configurable. All jobs must be picked up in order they were created.
But this gave me an idea - maybe we lack the ordered index for this particular query?
But this gave me an idea - maybe we lack the ordered index for this particular query?
Probably not. At least, I see these 3 indexes that I think should work:
create index ix_hangfire_jobqueue_jobidandqueue
on jobqueue (jobid, queue);
create index ix_hangfire_jobqueue_queueandfetchedat
on jobqueue (queue, fetchedat);
create index jobqueue_queue_fetchat_jobid
on jobqueue (queue, fetchedat, jobid);
All jobs must be picked up in order they were created
Can you point me where it is documented that Hangfire-compatible storage must ensure ordered job pickup?
I just checked MSSQL server implementation, and they do NOT order the jobs on deque (see https://github.com/HangfireIO/Hangfire/blob/main/src/Hangfire.SqlServer/SqlServerJobQueue.cs#L191-L201)
Here's their SQL query:
set nocount on;set xact_abort on;set tran isolation level read committed;
update top (1) JQ
set FetchedAt = GETUTCDATE()
output INSERTED.Id, INSERTED.JobId, INSERTED.Queue, INSERTED.FetchedAt
from [{_storage.SchemaName}].JobQueue JQ with (forceseek, readpast, updlock, rowlock)
where Queue in @queues and
(FetchedAt is null or FetchedAt < DATEADD(second, @timeoutSs, GETUTCDATE()));
Ordering question has been discussed here https://github.com/HangfireIO/Hangfire/issues/398 , and it seems that the correct way to maintain job dependencies is using job continuation.
I think that it really depends on the use case of each consumer whether ordering is important or not. Some users (just as myself) suffered from enforced ordering pattern. We had 4 queues, and wanted them to be processed almost evenly, but it was not possible because workers would pickup jobs using order by queue names.
Introducing an option to avoid sorting can be considered a feature that gives users even more control over their processing pipeline.
Ensuring ordering is really putting much stress on performance. I did a very stupid investigation, but results can be seen easily. I generated 4 queues with 30_000 jobs each, and ran hangfire server two times:
See the container CPU usage chart from docker.
public async Task Delayed()
{}
@azygis I stand firm on my suggestion with a config flag that removes ordering in exchange for more performance. If you approve, I can try to submit a PR sometime next week.
@vadim-kupovykh do you think this tradeoff would be applicable for your case, at least?
Though, it may require additional tweaks on Dashboard and other parts of the library. On the dashboard, I saw strange artifacts when page with queue jobs would show "No jobs in queue" occasionally during active processing.
I also read this article. It is basically saying that FOR UPDATE SKIP LOCKED
is the best way to do queues in postgres, but performance will suffer with multiple workers.
https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/
using Hangfire;
using Hangfire.PostgreSql;
string[] queues = { "alpha", "beta", "gamma", "test" };
bool generateJobs = true; //Toggle to switch between creating new jobs and processing them
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddHangfire(configuration => {
configuration.UsePostgreSqlStorage(options => {
options.UseNpgsqlConnection("User ID=postgres;Password=mysecretpassword;Host=localhost;Port=5432;Database=hangfire-postgres");
});
});
if (!generateJobs)
{
builder.Services.AddHangfireServer(options => {
options.Queues = queues;
options.WorkerCount = 60;
});
}
var app = builder.Build();
if (generateJobs)
{
var client = app.Services.GetRequiredService<IBackgroundJobClientV2>();
foreach (string queue in queues)
{
Task.Run(() => {
for (int i = 0; i < 30_000; i++)
{
client.Enqueue<Jobs>(queue, jobs => jobs.Delayed());
}
});
}
}
app.UseHangfireDashboard("");
app.Run();
public class Jobs
{
public async Task Delayed()
{
}
}
I compiled Hangfire.Postgres locally, and basically removed ORDER BY ...
statement from here: https://github.com/hangfire-postgres/Hangfire.PostgreSql/blob/ffb935257cd006ec9e01df536ef98005904faec0/src/Hangfire.PostgreSql/PostgreSqlJobQueue.cs#L143
What if you specify index order? Currently it's a default, which is ASC NULLS LAST. Query does a complete 180 with NULLS FIRST.
create index jobqueue_queue_fetchat_jobid
on jobqueue (fetchedat NULLS FIRST, queue, jobid);
Something like that?
Can you point me where it is documented that Hangfire-compatible storage must ensure ordered job pickup?
I cannot. It's a mistake done by many, and since the package has been with this ordering for quite some time, it can become a breaking change for some people. Of course, with a default of "keep ordering" it's not a breaking change though.
What if you specify index order? Currently it's a default, which is ASC NULLS LAST. Query does a complete 180 with NULLS FIRST.
@azygis brilliant idea!
On my test data it works as fast as w/o ordering.
Guys, you are amazing.:) going to test with ordered index! Will update you later.
Lovely! Then we don't need any conditionals, I suppose?
Would you be able to look at other indexes? If order is needed for any of them?
Edit: @vadim-kupovykh yeah it's kinda safe to do live. When a new version is deployed, the index will just get recreated.
Lovely! Then we don't need any conditionals, I suppose?
Yes, we don't. I tried to test two cases with regards to ordering: 1) Reversed order by queue name when query has to pick first jobs from the bottom of the queue 2) Evenly distributed jobs based on queue name.
I didn't see any significant difference between these two. Both cases run faster with updated index. At least, we did not make it worse.
Would you be able to look at other indexes? If order is needed for any of them?
I might take a look next weekend.
Right now I'm not even sure if other indexes for jobqueue
are needed at all. I had a brief look at past commits. , and there were performance improvements creating new indexes in that table, but old ones were not touched.
Maybe it's time to do some clean ups.
@dmitry-vychikov I've checked on my side with just applying index order - so, we don't have that unexpected degradation anymore. The query is executing ten times faster. Please merge into master, we are ready to update ASAP on all our environments.
We are continuing to monitor that the issue will not reproduce, but here is the latency before applying index order and after:
I hope we will not have degradation anymore.
@dmitry-vychikov, @azygis thank you guys for the fast solution!
Published as 1.20.8.
Hi guys,
Here is query we see in Postgress logs
We have issues with performance on the Postgress side, sometimes execution up to 8 seconds of that query. We don't have so many jobs in the queue, up to 20K. Any ideas about what it can be?