vadim-kupovykh commented 7 months ago

Hi guys,

Here is query we see in Postgress logs

UPDATE
  "hangfire"."jobqueue"
SET
  "fetchedat" = NOW()
WHERE
  "id" = (
  SELECT
    "id"
  FROM
    "hangfire"."jobqueue"
  WHERE
    "queue" = ANY ($1)
    AND ("fetchedat" IS NULL
      OR "fetchedat" < NOW() + INTERVAL $2)
  ORDER BY
    "fetchedat"
  NULLS FIRST
    ,
    "queue",
    "jobid" FOR
  UPDATE
    SKIP LOCKED
  LIMIT
    $3 ) RETURNING "id" AS "Id",
  "jobid" AS "JobId",
  "queue" AS "Queue",
  "fetchedat" AS "FetchedAt"

We have issues with performance on the Postgress side, sometimes execution up to 8 seconds of that query. We don't have so many jobs in the queue, up to 20K. Any ideas about what it can be?

dmitry-vychikov commented 7 months ago

Hi,

Please share additional info:

hangfire and hangfire postgres versions
Hangfire configuration code
How many workers and queues do you have
How many hangfire processes do you run
Dotnet version
Is it deployed on cloud?
what's CPU and mem usage on the postgres server?

vadim-kupovykh commented 7 months ago

Hi, Sure

dotnet 7
Hangfire.Core - 1.8.6
Hangfire.PostgreSql - 1.20.4.0

services.AddHangfire(globalConfiguration =>
            {
                globalConfiguration.UsePostgreSqlStorage(Configuration.GetConnectionString(HangfireExtensions.HangfireDatabase));
                globalConfiguration.UseSerializerSettings(HangfireExtensions.HangfireJsonSerializerSettings);
                globalConfiguration.UseTagsWithPostgreSql();
            });

services.AddHangfireServer(options =>
            {
                options.WorkerCount = agentSettings.GetValue("WorkerCount", 15);
                options.Queues = targetQueues;
            });

We have in GCP a few servers. Each server could have specific list of queues. Independent database for Hangfire as service in GCP. Database CPU - 100%, memory - around 50%.

We can add additional instances fast. However, we have that issue when we have even 1 server.

Execution plan of that query

13 secs latency for an index scan. And it's slowly degrading. We have 60k jobs in the queue and hangfire cannot start processing more than 5-10 jobs. Now we are not even creating jobs, just trying to process existing. So, technically we cannot add additional servers for horizontal scaling, because DB is dying. Do you have any ideas about what are we doing wrong?

vadim-kupovykh commented 7 months ago

It seems like the database is feeling better when the queue is smaller. Latency better and better with decreasing queue.

Any suggestions? :)

dmitry-vychikov commented 7 months ago

Try to ~~reduce~~ increase polling interval for workers. There must be a settings for this inside of in AddHangfireServer or UsePostgresSqlStorage calls. If you have lots of workers that constantly poll the db, but never get any jobs (e.g because their own queue is empty), this can create unnecessary stress.

vadim-kupovykh commented 7 months ago

That is not the desired behavior. We have small jobs and all servers could process those. So, technically increasing the pooling interval (15 seconds now) will mean that workers are just doing nothing sometimes. Or am I wrong?

vadim-kupovykh commented 7 months ago

And here are transactions per second for the database. Take a look at performance. Amount of jobs in queue decreasing and performance is growing. In the end we don't have jobs. We are talking about 10k-50k jobs in the queue. Hangfire is processing it for so long due to issues with fetching jobs for processing. Jobs duration 5-50ms.

azygis commented 7 months ago

That is not the desired behavior. We have small jobs and all servers could process those. So, technically increasing the pooling interval (15 seconds now) will mean that workers are just doing nothing sometimes. Or am I wrong?

Technically, yes; in practice, depending on the server which enqueues the job, it can be picked up by a worker in the same server instance immediately because of AutoResetEvent.

Partial screenshot of the execution plan has little value. Can you provide the actual query plan from PG in JSON format instead? I can't open any of these images in full size, personally (404).

EDIT: Also, what is behind UseTagsWithPostgreSql()?

azygis commented 7 months ago

Generated a million rows in the table. Here are the results. Even with a milion rows, query runs in sub-1s.

vadim-kupovykh commented 7 months ago

Generated a million rows in the table. Here are the results. Even with a milion rows, query runs in sub-1s.

That is my expectation. :) Will try to share the JSON plan later. I cannot it extract from GCP so simply. UseTagsWithPostgreSql enables Tags functionality by creating an independent table. It doesn't affect the initial table and selecting from jobqueue. OK, we will try to increase pooling interval.

azygis commented 7 months ago

I'd suggest trying manual vacuum on the table, if GCP allows that in some way. We've had some issues with autovacuum not doing the job so well in our business apps.

vadim-kupovykh commented 7 months ago

I'd suggest trying manual vacuum on the table, if GCP allows that in some way. We've had some issues with autovacuum not doing the job so well in our business apps.

Tried, didn't help :(

vadim-kupovykh commented 7 months ago

So, the reason is 100% related to amount of records in jobqueu. That subquery is the issue

SELECT "id"
                FROM "hangfire"."jobqueue"
                WHERE
                    "queue" = ANY (ARRAY['kafka-consumption'])
                  AND ("fetchedat" IS NULL
                    OR "fetchedat" < NOW() + INTERVAL '1 YEAR')
                ORDER BY
                    "fetchedat" NULLS FIRST,
                    "queue",
                    "jobid"
                for update skip locked
                LIMIT 1;

Here is the plan for the query https://explain.dalibo.com/plan/ga35182eddag2f6h#

As I understand we have problem because of for update skip locked. Without for update skip locked plan: https://explain.dalibo.com/plan/76gad6c1eec5bd4e But honestly, I don't understand why the plan is so different.

dmitry-vychikov commented 7 months ago

https://postgrespro.com/list/thread-id/2505440 Here a similar problem is discussed.

I quickly scanned through, and one recommendation is to introduce jitter lag to polling threads to decrease contention.

Maybe there is something better, need to read until the end.

vadim-kupovykh commented 7 months ago

https://postgrespro.com/list/thread-id/2505440 Here a similar problem is discussed.

I quickly scanned through, and one recommendation is to introduce jitter lag to polling threads to decrease contention.

Maybe there is something better, need to read until the end.

Checked that thread. Interesting, but I don’t understand what can we do. Will Increasing pooling interval help us? Honestly I don’t understand how to use Postgres in production now. We are doing everything by best practices, however additional workers fully destroying performance of job processing. Why nobody else has that issue? Or nobody is using Postgres in production?

azygis commented 7 months ago

Plenty of production applications use both PostgeSQL and this package. Each environment has different capabilities, bandwidth, hardware and so on, so each environment can behave differently.

I personally have no experience with GCP so I can't really say anything about it.

vadim-kupovykh commented 7 months ago

Plenty of production applications use both PostgeSQL and this package. Each environment has different capabilities, bandwidth, hardware and so on, so each environment can behave differently.

I personally have no experience with GCP so I can't really say anything about it.

Yeah, just topic above is not related to hangfire and the current package, but Postgres. That is what I don’t understand why we have that issue but nobody else. What are we doing differently?:) I do see that issue related to processing “for update skip locked” in many threads. Any ideas what can we do with that?

azygis commented 7 months ago

Oh, yeah, it's not about Hangfire.PostgreSql storage provider, but it is about PostgreSQL itself, which the provider does use.

I have to admit I've not dug into the performance/indexing/etc mechanisms in PG as I had no need for it. If GCP has a limited set of operations that you can perform on the database directly (again, no idea how GCP works), my suggestion is to perhaps try Redis provider. At least according to Hangfire docs it has a huge throughput increase compared to SQL (makes sense, since Redis is in-memory cache).

vadim-kupovykh commented 7 months ago

Oh, yeah, it's not about Hangfire.PostgreSql storage provider, but it is about PostgreSQL itself, which the provider does use.

I have to admit I've not dug into the performance/indexing/etc mechanisms in PG as I had no need for it. If GCP has a limited set of operations that you can perform on the database directly (again, no idea how GCP works), my suggestion is to perhaps try Redis provider. At least according to Hangfire docs it has a huge throughput increase compared to SQL (makes sense, since Redis is in-memory cache).

I believe you don’t understand the severity of that issue. Topic above is not related to GCP but yes, to something specific inside of Postgres. Until we understand what is that and how to mitigate it everyone who is using Hangfire with current package under risk of unpredictable performance degradation. It’s time bomb. It could happen once you try to add additional workers, or maybe not. Until we know what is that - production risks are too high for everyone.

azygis commented 7 months ago

I agree; to an extent.

That could happen with or without this package, as the linked mailing list shows. Without a reproduction there's virtually no way to see what can or can not happen.

I understand how you feel, being the only one which reported the issue in this repo (I'm not saying you're the only one that's impacted; just about reporting). Unfortunately, you're the one that has a reproduction and means to see whether any changes to the queries are valid or not. Since it's an OSS package, you can submit a pull request if there's a possible fix.

dmitry-vychikov commented 7 months ago

@vadim-kupovykh Can you provide a minimal runnable sample app that reproduces your issue say within a docker environment?

I may have some free time this weekend to try come up with something.

I have only 2 ideas so far:

Introduce jitter delay within Hangfire worker, as recommended by the article above
Try to change the the DB query in a way that sacrifices job ordering in exchange for performance gains. And expose it as a configuration option.

azygis commented 7 months ago

From what I understand, it's really about many workers trying to handle the same table, as in essence SKIP LOCKED should simply skip the locked rows (read - locked by other workers), but feels like postgresql can choke a little with many rows and many concurrent updates.

I'm not convinced there's a good way to just drop the ordering, even if configurable. All jobs must be picked up in order they were created.

But this gave me an idea - maybe we lack the ordered index for this particular query?

dmitry-vychikov commented 7 months ago

But this gave me an idea - maybe we lack the ordered index for this particular query?

Probably not. At least, I see these 3 indexes that I think should work:

create index ix_hangfire_jobqueue_jobidandqueue
    on jobqueue (jobid, queue);

create index ix_hangfire_jobqueue_queueandfetchedat
    on jobqueue (queue, fetchedat);

create index jobqueue_queue_fetchat_jobid
    on jobqueue (queue, fetchedat, jobid);

Requirements

All jobs must be picked up in order they were created

Can you point me where it is documented that Hangfire-compatible storage must ensure ordered job pickup?

I just checked MSSQL server implementation, and they do NOT order the jobs on deque (see https://github.com/HangfireIO/Hangfire/blob/main/src/Hangfire.SqlServer/SqlServerJobQueue.cs#L191-L201)

Here's their SQL query:

set nocount on;set xact_abort on;set tran isolation level read committed;

update top (1) JQ
set FetchedAt = GETUTCDATE()
output INSERTED.Id, INSERTED.JobId, INSERTED.Queue, INSERTED.FetchedAt
from [{_storage.SchemaName}].JobQueue JQ with (forceseek, readpast, updlock, rowlock)
where Queue in @queues and
(FetchedAt is null or FetchedAt < DATEADD(second, @timeoutSs, GETUTCDATE()));

Ordering question has been discussed here https://github.com/HangfireIO/Hangfire/issues/398 , and it seems that the correct way to maintain job dependencies is using job continuation.

Usage scenarios

I think that it really depends on the use case of each consumer whether ordering is important or not. Some users (just as myself) suffered from enforced ordering pattern. We had 4 queues, and wanted them to be processed almost evenly, but it was not possible because workers would pickup jobs using order by queue names.

Introducing an option to avoid sorting can be considered a feature that gives users even more control over their processing pipeline.

Performance

Ensuring ordering is really putting much stress on performance. I did a very stupid investigation, but results can be seen easily. I generated 4 queues with 30_000 jobs each, and ran hangfire server two times:

w/ ordering
- consumed 1600% of my CPU (all cores full load)
- only 20,000 jobs were processed within approx. 3 minutes
w/o ordering
- was consuming about 600% (almost 3 times less)
- All 120,000 jobs processed within 3 minutes

See the container CPU usage chart from docker.

My setup

Windows 11, x64
I7-10700k (16 cores)
48GB RAM
latest postgres, running in docker
1 hangfire server process, 60 workers
Empty jobs that do nothing:
- ```
 public async Task Delayed()
 {}
```

Suggested next steps

@azygis I stand firm on my suggestion with a config flag that removes ordering in exchange for more performance. If you approve, I can try to submit a PR sometime next week.

@vadim-kupovykh do you think this tradeoff would be applicable for your case, at least?

Though, it may require additional tweaks on Dashboard and other parts of the library. On the dashboard, I saw strange artifacts when page with queue jobs would show "No jobs in queue" occasionally during active processing.

Useful reading

I also read this article. It is basically saying that FOR UPDATE SKIP LOCKED is the best way to do queues in postgres, but performance will suffer with multiple workers.

https://www.2ndquadrant.com/en/blog/what-is-select-skip-locked-for-in-postgresql-9-5/

Source code of testing app

using Hangfire;
using Hangfire.PostgreSql;

string[] queues = { "alpha", "beta", "gamma", "test" };

bool generateJobs = true; //Toggle to switch between creating new jobs and processing them

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddHangfire(configuration => {
  configuration.UsePostgreSqlStorage(options => {
    options.UseNpgsqlConnection("User ID=postgres;Password=mysecretpassword;Host=localhost;Port=5432;Database=hangfire-postgres");
  });
});

if (!generateJobs)
{
  builder.Services.AddHangfireServer(options => {
    options.Queues = queues;
    options.WorkerCount = 60;
  });
}

var app = builder.Build();

if (generateJobs)
{
  var client = app.Services.GetRequiredService<IBackgroundJobClientV2>();

  foreach (string queue in queues)
  {
    Task.Run(() => {
      for (int i = 0; i < 30_000; i++)
      {
        client.Enqueue<Jobs>(queue, jobs => jobs.Delayed());
      }
    });
  }
}

app.UseHangfireDashboard("");
app.Run();

public class Jobs
{
  public async Task Delayed()
  {
  }
}

Toggling job ordering

I compiled Hangfire.Postgres locally, and basically removed ORDER BY ... statement from here: https://github.com/hangfire-postgres/Hangfire.PostgreSql/blob/ffb935257cd006ec9e01df536ef98005904faec0/src/Hangfire.PostgreSql/PostgreSqlJobQueue.cs#L143

azygis commented 7 months ago

What if you specify index order? Currently it's a default, which is ASC NULLS LAST. Query does a complete 180 with NULLS FIRST.

create index jobqueue_queue_fetchat_jobid
    on jobqueue (fetchedat NULLS FIRST, queue, jobid);

Something like that?

Can you point me where it is documented that Hangfire-compatible storage must ensure ordered job pickup?

I cannot. It's a mistake done by many, and since the package has been with this ordering for quite some time, it can become a breaking change for some people. Of course, with a default of "keep ordering" it's not a breaking change though.

dmitry-vychikov commented 7 months ago

What if you specify index order? Currently it's a default, which is ASC NULLS LAST. Query does a complete 180 with NULLS FIRST.

@azygis brilliant idea!

On my test data it works as fast as w/o ordering.

vadim-kupovykh commented 7 months ago

Guys, you are amazing.:) going to test with ordered index! Will update you later.

azygis commented 7 months ago

Lovely! Then we don't need any conditionals, I suppose?

Would you be able to look at other indexes? If order is needed for any of them?

Edit: @vadim-kupovykh yeah it's kinda safe to do live. When a new version is deployed, the index will just get recreated.

dmitry-vychikov commented 7 months ago

Lovely! Then we don't need any conditionals, I suppose?

Yes, we don't. I tried to test two cases with regards to ordering: 1) Reversed order by queue name when query has to pick first jobs from the bottom of the queue 2) Evenly distributed jobs based on queue name.

I didn't see any significant difference between these two. Both cases run faster with updated index. At least, we did not make it worse.

Would you be able to look at other indexes? If order is needed for any of them?

I might take a look next weekend.

Right now I'm not even sure if other indexes for jobqueue are needed at all. I had a brief look at past commits. , and there were performance improvements creating new indexes in that table, but old ones were not touched.

Maybe it's time to do some clean ups.

vadim-kupovykh commented 7 months ago

@dmitry-vychikov I've checked on my side with just applying index order - so, we don't have that unexpected degradation anymore. The query is executing ten times faster. Please merge into master, we are ready to update ASAP on all our environments.

We are continuing to monitor that the issue will not reproduce, but here is the latency before applying index order and after:

I hope we will not have degradation anymore.

@dmitry-vychikov, @azygis thank you guys for the fast solution!

azygis commented 7 months ago

Published as 1.20.8.

hangfire-postgres / Hangfire.PostgreSql

Performance Issues with Postgress #348

Requirements

Usage scenarios

Performance

My setup

Suggested next steps

Useful reading

Source code of testing app

Toggling job ordering