Hangfire dashboard - maxed out connection pools

rizwanja commented 4 years ago

When configuring Hangfire in our ASP.NET Core app, we see dozens of idle connections in postgres.

We're using MassTransit with our server worker count set to 1. With four apps with this configuration and MassTransit using 5 connections each, we should expect to see around 20 connections or so for Hangfire. However, we see our connection pool maxed out at 100 connections, mostly idle Hangfire ones.

We had initially thought this might have been happening while configuring the Hangfire dashboard that was not the case. Once the service is run, the idle connections continue to increase until they run out.

Note that we have a Hangfire DB for each service that has a Hangfire server configured.

We tried using both v1.7.0 and v1.7.1 of the Nuget package with the same issue.

Relevant Startup.cs Services config sections:

services.AddHangfire(configuration => configuration
                   .SetDataCompatibilityLevel(CompatibilityLevel.Version_170)
                   .UseSimpleAssemblyNameTypeSerializer()
                   .UseRecommendedSerializerSettings()
                   .UsePostgreSqlStorage(Configuration["Hangfire:ConnectionString"]));

services.AddHangfireServer(options =>
            {
                options.WorkerCount = 1;
            });

services.AddMassTransit(x =>
            {
                x.AddConsumer<SampleUpdateEventConsumer>();

                    x.AddBus(context => Bus.Factory.CreateUsingAzureServiceBus(cfg =>
                    {
                        cfg.Host(Configuration["ServiceBus:Host"], h =>
                        {
                            h.SharedAccessSignature(s =>
                            {
                                s.KeyName = Configuration["ServiceBus:AzureServiceBusKey"];
                                s.SharedAccessKey = Configuration["ServiceBus:AzureServiceBusAccessKey"];
                                s.TokenTimeToLive = TimeSpan.FromDays(1);
                                s.TokenScope = TokenScope.Namespace;
                            });
                        });

                        cfg.UseHangfireScheduler(Configuration["ServiceEndpoint"] + "_hangfire");

                        cfg.ReceiveEndpoint(Configuration["ServiceEndpoint"], ep =>
                        {
                            ep.UseScheduledRedelivery(r => r.Interval(Configuration.GetValue<int>("MessageRedeliveryCount"), TimeSpan.FromMinutes(1)));
                            ep.UseMessageRetry(r => r.Interval(2, TimeSpan.FromSeconds(5)));
                            ep.ConfigureConsumer<PatientProfileUpdateEventConsumer>(context);

                        });
                    }));
            });

services.AddMassTransitHostedService();

Environment: -Azure Postgres SQL (v11) -ASP.NET Core 3.1 -Azure App Service -Azure Service Bus -MassTransit 7.0.3 -Hangfire.AspNetCore 1.7.12 -Hangfire.PostgreSql 1.7.1 (also tried 1.7.0)

jmartins-sh commented 3 years ago

Hi @rizwanja, how are you doing?

Did you solve your problem? If you did may you let us know how do you solve it? I'm facing a similar problem!

Kind regards!

rizwanja commented 3 years ago

@joaoantoniomartinsfilho In the time that we had, we weren't able to find a suitable solution to this so our workaround was to just use Azure SQL instead of Postgres for our Hangfire DBs.

alexrosenfeld10 commented 3 years ago

we're facing the same problem ☹️

bduman commented 3 years ago

You need to dispose after creating each connection. In my case, I used like this.

using var connection = JobStorage.Current.GetConnection();

alexrosenfeld10 commented 3 years ago

I've checked my code numerous times, I do dispose. Also, worth noting, I never actually use the JobStorage.Current.GetConnection() method either.

frankhommers commented 2 years ago

Is this problem still there in 1.9.3 ?

alexrosenfeld10 commented 2 years ago

initial testing seems to indicate so!

strigefleur commented 2 years ago

Same happens for 1.8.6 - would love to have some attention on that fact, since we can't immediately migrate to pg6.

azygis commented 2 years ago

There is a package now for Npgsql5. Same versioning and features, just a different name because of a different dependency - Hangfire.PostgreSql.Npgsql5. Simply switch to that one until you're able to use Npgsql6.

strigefleur commented 2 years ago

Gotcha, thanks. I'll comment back as soon as we try this version.

azygis commented 2 years ago

Please do. The primary reason why I rewrote connection management was to actually release them once done with the jobs, but I'm not sure how the dashboard works (yet). If the issue persists, need to investigate more.

iam-black commented 2 years ago

Same thing after migrating to .NET 6. Disabling dashboard didn't help. Hangfire.PostgreSql 1.9.5, Npgsql.EntityFrameworkCore.PostgreSQL 6.0.2

mantasaudickas commented 2 years ago

I actually noticed interesting behavior: if I disable connection string pooling, my connection count starts growing and connections stays idle, but is never closed..

Normally I have these settings:

Pooling=true;ConnectionIdleLifetime=5;ConnectionPruningInterval=2;Maximum Pool Size=500

but for some tests I wanted to disable pooling at all:

Pooling=false;ConnectionIdleLifetime=5;ConnectionPruningInterval=2;Maximum Pool Size=500

which produced this graphic (while normally my connections are at about 50)..

azygis commented 2 years ago

This and linked issues seem to be related. We don't really do anything ourselves with the connection string/setup.

buddzbuddy commented 2 years ago

same issues

frankhommers commented 2 years ago

I don't know how to approach this... Does anyone has suggestions?

iam-black commented 2 years ago

I don't know how to approach this... Does anyone has suggestions?

It's not quite the fix but we switched to Redis storage for now (Hangfire.Redis.StackExchange).

azygis commented 2 years ago

It would be interesting to see the connection strings (with sensitive details modified, of course, so all the config params are visible), how hangfire is configured (both adding services and any other configuration) and the results from pg_stat_activity similar to how @iam-black showed, but with all columns (especially query_start). I wonder if the connection is actually idle, or just gets reused again and again and there might be issues with Npgsql connection pool management.

I cannot reproduce the issue with 1.9.5, the most idle hangfire-related connections I've had today was 10. They either get released or are reused eventually. Count for me is anything from 2 to 10 with 5 workers active, dashboard online all the time.

iam-black commented 2 years ago

It would be interesting to see the connection strings (with sensitive details modified, of course, so all the config params are visible), how hangfire is configured (both adding services and any other configuration) and the results from pg_stat_activity similar to how @iam-black showed, but with all columns (especially query_start)

$"Host={host};Port={port};Database={dbName};Username={user};Password={password};IntegratedSecurity=false;CommandTimeout=15;Pooling=false";

Output from pg_stat_activity postgres_pg_catalog_pg_stat_activity.xlsx

azygis commented 2 years ago

I will try disabling pooling and checking how it goes that way. With pooling on (default) I am still only using the same amount of connections as previously (2 - 10), haven't seen a leak so far for a few days.

Question - why is the pooling off? Just curious what's the use case of disabling it altogether, considering the act of opening new connections from scratch is quite expensive. Do you use pgbouncer or something like that?

P. S. At least I can't open the attached file, github dies with 500 (at the moment). Can you reattach the file?

mantasaudickas commented 2 years ago

It makes sense to disable connection pooling on multitenant applications, where each tenant has own database. Lets imagine if you have 1000 of these databases and you enable connection pooling - 1 request to each of these databases during pooling period will produce 1000 connections which is not acceptable. Thus aggressive pruning or pooling false are used in these cases.

iam-black commented 2 years ago

Question - why is the pooling off?

I am not aware about roots causes of that decision but I can see how it relates to what @mantasaudickas described. In our case this is a dev env and we have ~15 services using the same DB server. Job can be triggered in a multiple services at once. So I guess the reason is the same - to prevent connection exhaustion by background jobs.

Can you reattach the file?

Reattached as csv. postgres_pg_catalog_pg_stat_activity.csv

azygis commented 2 years ago

Thank you for clarification. Indeed it makes sense then to have pooling off, although I've seen some posts that this might also be needed if any other "middle-man" is used, like pgBouncer, pgpool, etc. Are any of you using that?

Unfortunately, I cannot reproduce the issue even with pooling off. I can barely find even a single row of hangfire connection in pg_stat_activity while being on dashboard this way. You could try checking Npgsql logs using their own logging provider (or making your own). I see the connection closed on every single query hangfire runs.

frankhommers commented 2 years ago

I think we need a PR for this. I am not really using the dashboard much.

evolvedlight commented 2 years ago

Hey,

We had this on our production application too - basically Hangfire would slowly use more and more connections until eventually we'd have port exhaustion and we'd have a production incident.

We had connection pooling turned off, because we are also using pgbouncer and the docs say that it's best to turn connection pooling off (https://www.npgsql.org/doc/compatibility.html#pgbouncer)

We had a job that looks like this:

        public Task Execute()
        {

            foreach (var profile in Enum.GetValues(typeof(ThingProfileEnum)).Cast<ThingProfileEnum>().ToList())
            {
                _logger.LogInformation("Enqueueing fetch job for profile: {Profile}", profile);
                BackgroundJob.Enqueue<ThingContactFetchJob>(j => j.FetchContacts(profile));
            }

            return Task.CompletedTask;
        }

Every call to BackgroundJob.Enqueue was creating a new connection which was never disposed. The graph of outgoing connections:

Hope this helps, if someone is interested I'm relatively sure that I could create a full sample project that could be F5-ed to reproduce it.

azygis commented 2 years ago

Please, if you can, do create the F5-able repro project. It would help a lot, since at least we are not hitting any issues in our project. Although our connection pool is not off (which seems to be a common denominator for the issue at hand), and briefly disabling it never raised anything either, and I do not have enough time to setup dummy pgBouncer or something similar.

evolvedlight commented 2 years ago

Please, if you can, do create the F5-able repro project. It would help a lot, since at least we are not hitting any issues in our project. Although our connection pool is not off (which seems to be a common denominator for the issue at hand), and briefly disabling it never raised anything either, and I do not have enough time to setup dummy pgBouncer or something similar.

Sure, no problem, here it is: https://github.com/evolvedlight/HangFirePostgresIssue We've tested it on PG13 and PG14, both the same, all libraries are the newest versions. As you see, it's a standard Asp.Net Core project that includes Hangfire + Postgres. It has a job that runs every minute that spawns 5 other jobs. I even made a little gif to hopefully show it working. Thanks :)

ercag commented 1 year ago

Does any bug-fix or workaround for this issue?

hasanmanzak commented 1 year ago

Every call to BackgroundJob.Enqueue was creating a new connection which was never disposed.

Not only Enqueue but the Schedule also. And BackgroundJobClient definetely makes use of a using block:

https://github.com/HangfireIO/Hangfire/blob/master/src/Hangfire.Core/BackgroundJobClient.cs#L146

OptimumDev commented 1 year ago

Found the same issue in our project, which schedules new jobs from another job. We upgraded Hangfire.PostgreSql from 1.8.4 to 1.9.9 and connections started leaking (event though we don't use hangfire dashboard) We don't use pooling because of pgbouncer. In another service, which uses recurring jobs just for some calculations (without creating new jobs), everything works fine with latest version - nothing leaks.

After some testing it looks like problem appeared in version 1.9.5

azygis commented 1 year ago

Still waiting for the new version to be pushed to NuGet by @frankhommers. Fixed a few deadlock scenarios in the latest merge, might have been fixed.

frankhommers commented 1 year ago

Done

OptimumDev commented 1 year ago

Unfortunatly nothing changed, connections are still leaking

azygis commented 1 year ago

Fair enough, didn't have high hopes.

Unfortunately I do not have the time to investigate loadbalancer behavior. We'll have to wait until someone has some will to do that. I think we use some sort of loadbalancer at work, but we did not disable the internal pooling in npgsql, hence never noticed such issues.

cubed-it commented 1 year ago

I can also confirm that with 1.9.10 there is no improvement on the topic of pgBouncer and disabled pooling. cl_active is constantly increasing until max_client_conn brings down all services at once. Therefore, we have temporarily enabled pooling for all hangfire connections. This has been running for a few hours without any further connection leak.

hangfire_leak

The question remains whether the leak is simply shifted into the service if one activated pooling.

b-rain-m commented 1 year ago

Since this is still an issue I decided to take a crack and figuring out what is going wrong. I checked out evolvedlight's repo as well as the Hangfire Postgres source code and added a bit of logging to play around.

I also created a query to see active connections:

select pid, xact_start, query from pg_stat_activity
    where datname = 'hangfiredemo'
            and pid <> pg_backend_pid();

The first thing I've observed is that the Dispose call in the following line leaves the connection open: https://github.com/frankhommers/Hangfire.PostgreSql/blob/c5fab5630a237888e16b6f7f5d76e7df6fefdd5c/src/Hangfire.PostgreSql/PostgreSqlStorage.cs#L393

Playing around connection.Close() does not fix the problem. Finally I attempted to force the connection closed with connection.Execute("SELECT pg_terminate_backend(pg_backend_pid());"); to see what would happen and started getting a transaction error.

Finally I reverted the ReleaseConnection code back to what it was, and commented out the transaction code.

With the transaction logic disabled the connections appeared to start closing.

Since it feels like an Npgsql issue, I downgraded to Npgsql 5.0 and the problem also seems to go away. Flipping back to Npgsql to 6.0 and the problem comes back.

b-rain-m commented 1 year ago

This is an Npgsql problem. To replicate run this console app with a breakpoint on the Hello World. Make sure to adjust the connection string as appropriate.

using Npgsql;
using System.Transactions;

using (var transactionScope = new TransactionScope())
{
    var connection = new NpgsqlConnection("User ID=postgres;Password=password;Server=localhost;Port=5432;Database=hangfiredemo;Integrated Security=true;Pooling=false;");

    connection.Open();
    connection.EnlistTransaction(Transaction.Current);

    var cmd = connection.CreateCommand();
    cmd.CommandText = "SELECT 'test-query'";
    cmd.ExecuteNonQuery();

    connection.Dispose();

    transactionScope.Complete();
}

Console.WriteLine("Hello, World!");

After the breakpoint is hit run this against the PostgreSQL server:

select pid, xact_start, query from pg_stat_activity
    where datname = 'hangfiredemo'
            and pid <> pg_backend_pid();

The problem appears to have been introduced in 6.0.0-preview6.

I've submitted the issue to the Npgsql repo: https://github.com/npgsql/npgsql/issues/4963#issue-1606257619

I've also found a workaround. Creating a transaction scope after the connection is open rather than before doesn't result in a connection leak:

using Npgsql;
using System.Transactions;

var connection = new NpgsqlConnection("User ID=postgres;Password=password;Server=localhost;Port=5432;Database=hangfiredemo;Integrated Security=true;Pooling=false;");

connection.Open();

using (var transactionScope = new TransactionScope())
{
    connection.EnlistTransaction(Transaction.Current);

    var cmd = connection.CreateCommand();
    cmd.CommandText = "SELECT 'test-query'";
    cmd.ExecuteNonQuery();

    transactionScope.Complete();
}

connection.Dispose();

Console.WriteLine("Hello, World!");

azygis commented 1 year ago

The new version has been pushed to the feed. I will leave this one active until someone is able to verify whether we still have the issue (hopefully no more).

azygis commented 1 year ago

Since I see it's been fixed in Npgsql (both 6 and 7) as well, it should be good. If it's still happening, feel free to reopen the issue.

dahlbyk commented 1 year ago

Since I see it's been fixed in Npgsql (both 6 and 7) as well, it should be good.

FYI for anyone not able to upgrade to Npgsql 7, the fix backported to 6.0.10 hasn't been released yet.

hangfire-postgres / Hangfire.PostgreSql

Hangfire dashboard - maxed out connection pools #163