Closed rizwanja closed 1 year ago
Hi @rizwanja, how are you doing?
Did you solve your problem? If you did may you let us know how do you solve it? I'm facing a similar problem!
Kind regards!
@joaoantoniomartinsfilho In the time that we had, we weren't able to find a suitable solution to this so our workaround was to just use Azure SQL instead of Postgres for our Hangfire DBs.
we're facing the same problem ☹️
You need to dispose after creating each connection. In my case, I used like this.
using var connection = JobStorage.Current.GetConnection();
I've checked my code numerous times, I do dispose. Also, worth noting, I never actually use the JobStorage.Current.GetConnection()
method either.
Is this problem still there in 1.9.3 ?
initial testing seems to indicate so!
Same happens for 1.8.6 - would love to have some attention on that fact, since we can't immediately migrate to pg6.
There is a package now for Npgsql5. Same versioning and features, just a different name because of a different dependency - Hangfire.PostgreSql.Npgsql5
. Simply switch to that one until you're able to use Npgsql6.
Gotcha, thanks. I'll comment back as soon as we try this version.
Please do. The primary reason why I rewrote connection management was to actually release them once done with the jobs, but I'm not sure how the dashboard works (yet). If the issue persists, need to investigate more.
Same thing after migrating to .NET 6. Disabling dashboard didn't help. Hangfire.PostgreSql 1.9.5, Npgsql.EntityFrameworkCore.PostgreSQL 6.0.2
I actually noticed interesting behavior: if I disable connection string pooling, my connection count starts growing and connections stays idle, but is never closed..
Normally I have these settings:
Pooling=true;ConnectionIdleLifetime=5;ConnectionPruningInterval=2;Maximum Pool Size=500
but for some tests I wanted to disable pooling at all:
Pooling=false;ConnectionIdleLifetime=5;ConnectionPruningInterval=2;Maximum Pool Size=500
which produced this graphic (while normally my connections are at about 50)..
This and linked issues seem to be related. We don't really do anything ourselves with the connection string/setup.
same issues
I don't know how to approach this... Does anyone has suggestions?
I don't know how to approach this... Does anyone has suggestions?
It's not quite the fix but we switched to Redis storage for now (Hangfire.Redis.StackExchange).
It would be interesting to see the connection strings (with sensitive details modified, of course, so all the config params are visible), how hangfire is configured (both adding services and any other configuration) and the results from pg_stat_activity
similar to how @iam-black showed, but with all columns (especially query_start
). I wonder if the connection is actually idle, or just gets reused again and again and there might be issues with Npgsql connection pool management.
I cannot reproduce the issue with 1.9.5, the most idle hangfire-related connections I've had today was 10. They either get released or are reused eventually. Count for me is anything from 2 to 10 with 5 workers active, dashboard online all the time.
It would be interesting to see the connection strings (with sensitive details modified, of course, so all the config params are visible), how hangfire is configured (both adding services and any other configuration) and the results from
pg_stat_activity
similar to how @iam-black showed, but with all columns (especiallyquery_start
)
$"Host={host};Port={port};Database={dbName};Username={user};Password={password};IntegratedSecurity=false;CommandTimeout=15;Pooling=false";
Output from pg_stat_activity postgres_pg_catalog_pg_stat_activity.xlsx
I will try disabling pooling and checking how it goes that way. With pooling on (default) I am still only using the same amount of connections as previously (2 - 10), haven't seen a leak so far for a few days.
Question - why is the pooling off? Just curious what's the use case of disabling it altogether, considering the act of opening new connections from scratch is quite expensive. Do you use pgbouncer or something like that?
P. S. At least I can't open the attached file, github dies with 500 (at the moment). Can you reattach the file?
It makes sense to disable connection pooling on multitenant applications, where each tenant has own database. Lets imagine if you have 1000 of these databases and you enable connection pooling - 1 request to each of these databases during pooling period will produce 1000 connections which is not acceptable. Thus aggressive pruning or pooling false are used in these cases.
Question - why is the pooling off?
I am not aware about roots causes of that decision but I can see how it relates to what @mantasaudickas described. In our case this is a dev env and we have ~15 services using the same DB server. Job can be triggered in a multiple services at once. So I guess the reason is the same - to prevent connection exhaustion by background jobs.
Can you reattach the file?
Reattached as csv. postgres_pg_catalog_pg_stat_activity.csv
Thank you for clarification. Indeed it makes sense then to have pooling off, although I've seen some posts that this might also be needed if any other "middle-man" is used, like pgBouncer, pgpool, etc. Are any of you using that?
Unfortunately, I cannot reproduce the issue even with pooling off. I can barely find even a single row of hangfire connection in pg_stat_activity
while being on dashboard this way. You could try checking Npgsql logs using their own logging provider (or making your own). I see the connection closed on every single query hangfire runs.
I think we need a PR for this. I am not really using the dashboard much.
Hey,
We had this on our production application too - basically Hangfire would slowly use more and more connections until eventually we'd have port exhaustion and we'd have a production incident.
We had connection pooling turned off, because we are also using pgbouncer and the docs say that it's best to turn connection pooling off (https://www.npgsql.org/doc/compatibility.html#pgbouncer)
We had a job that looks like this:
public Task Execute()
{
foreach (var profile in Enum.GetValues(typeof(ThingProfileEnum)).Cast<ThingProfileEnum>().ToList())
{
_logger.LogInformation("Enqueueing fetch job for profile: {Profile}", profile);
BackgroundJob.Enqueue<ThingContactFetchJob>(j => j.FetchContacts(profile));
}
return Task.CompletedTask;
}
Every call to BackgroundJob.Enqueue was creating a new connection which was never disposed. The graph of outgoing connections:
Hope this helps, if someone is interested I'm relatively sure that I could create a full sample project that could be F5-ed to reproduce it.
Please, if you can, do create the F5-able repro project. It would help a lot, since at least we are not hitting any issues in our project. Although our connection pool is not off (which seems to be a common denominator for the issue at hand), and briefly disabling it never raised anything either, and I do not have enough time to setup dummy pgBouncer or something similar.
Please, if you can, do create the F5-able repro project. It would help a lot, since at least we are not hitting any issues in our project. Although our connection pool is not off (which seems to be a common denominator for the issue at hand), and briefly disabling it never raised anything either, and I do not have enough time to setup dummy pgBouncer or something similar.
Sure, no problem, here it is: https://github.com/evolvedlight/HangFirePostgresIssue We've tested it on PG13 and PG14, both the same, all libraries are the newest versions. As you see, it's a standard Asp.Net Core project that includes Hangfire + Postgres. It has a job that runs every minute that spawns 5 other jobs. I even made a little gif to hopefully show it working. Thanks :)
Does any bug-fix or workaround for this issue?
Every call to BackgroundJob.Enqueue was creating a new connection which was never disposed.
Not only Enqueue
but the Schedule
also. And BackgroundJobClient
definetely makes use of a using
block:
https://github.com/HangfireIO/Hangfire/blob/master/src/Hangfire.Core/BackgroundJobClient.cs#L146
Found the same issue in our project, which schedules new jobs from another job. We upgraded Hangfire.PostgreSql from 1.8.4 to 1.9.9 and connections started leaking (event though we don't use hangfire dashboard) We don't use pooling because of pgbouncer. In another service, which uses recurring jobs just for some calculations (without creating new jobs), everything works fine with latest version - nothing leaks.
After some testing it looks like problem appeared in version 1.9.5
Still waiting for the new version to be pushed to NuGet by @frankhommers. Fixed a few deadlock scenarios in the latest merge, might have been fixed.
Done
Unfortunatly nothing changed, connections are still leaking
Fair enough, didn't have high hopes.
Unfortunately I do not have the time to investigate loadbalancer behavior. We'll have to wait until someone has some will to do that. I think we use some sort of loadbalancer at work, but we did not disable the internal pooling in npgsql, hence never noticed such issues.
I can also confirm that with 1.9.10 there is no improvement on the topic of pgBouncer and disabled pooling. cl_active
is constantly increasing until max_client_conn
brings down all services at once.
Therefore, we have temporarily enabled pooling for all hangfire connections. This has been running for a few hours without any further connection leak.
The question remains whether the leak is simply shifted into the service if one activated pooling.
Since this is still an issue I decided to take a crack and figuring out what is going wrong. I checked out evolvedlight's repo as well as the Hangfire Postgres source code and added a bit of logging to play around.
I also created a query to see active connections:
select pid, xact_start, query from pg_stat_activity
where datname = 'hangfiredemo'
and pid <> pg_backend_pid();
The first thing I've observed is that the Dispose call in the following line leaves the connection open: https://github.com/frankhommers/Hangfire.PostgreSql/blob/c5fab5630a237888e16b6f7f5d76e7df6fefdd5c/src/Hangfire.PostgreSql/PostgreSqlStorage.cs#L393
Playing around connection.Close()
does not fix the problem.
Finally I attempted to force the connection closed with connection.Execute("SELECT pg_terminate_backend(pg_backend_pid());");
to see what would happen and started getting a transaction error.
Finally I reverted the ReleaseConnection code back to what it was, and commented out the transaction code.
With the transaction logic disabled the connections appeared to start closing.
Since it feels like an Npgsql issue, I downgraded to Npgsql 5.0 and the problem also seems to go away. Flipping back to Npgsql to 6.0 and the problem comes back.
This is an Npgsql problem. To replicate run this console app with a breakpoint on the Hello World. Make sure to adjust the connection string as appropriate.
using Npgsql;
using System.Transactions;
using (var transactionScope = new TransactionScope())
{
var connection = new NpgsqlConnection("User ID=postgres;Password=password;Server=localhost;Port=5432;Database=hangfiredemo;Integrated Security=true;Pooling=false;");
connection.Open();
connection.EnlistTransaction(Transaction.Current);
var cmd = connection.CreateCommand();
cmd.CommandText = "SELECT 'test-query'";
cmd.ExecuteNonQuery();
connection.Dispose();
transactionScope.Complete();
}
Console.WriteLine("Hello, World!");
After the breakpoint is hit run this against the PostgreSQL server:
select pid, xact_start, query from pg_stat_activity
where datname = 'hangfiredemo'
and pid <> pg_backend_pid();
The problem appears to have been introduced in 6.0.0-preview6.
I've submitted the issue to the Npgsql repo: https://github.com/npgsql/npgsql/issues/4963#issue-1606257619
I've also found a workaround. Creating a transaction scope after the connection is open rather than before doesn't result in a connection leak:
using Npgsql;
using System.Transactions;
var connection = new NpgsqlConnection("User ID=postgres;Password=password;Server=localhost;Port=5432;Database=hangfiredemo;Integrated Security=true;Pooling=false;");
connection.Open();
using (var transactionScope = new TransactionScope())
{
connection.EnlistTransaction(Transaction.Current);
var cmd = connection.CreateCommand();
cmd.CommandText = "SELECT 'test-query'";
cmd.ExecuteNonQuery();
transactionScope.Complete();
}
connection.Dispose();
Console.WriteLine("Hello, World!");
The new version has been pushed to the feed. I will leave this one active until someone is able to verify whether we still have the issue (hopefully no more).
Since I see it's been fixed in Npgsql (both 6 and 7) as well, it should be good. If it's still happening, feel free to reopen the issue.
Since I see it's been fixed in Npgsql (both 6 and 7) as well, it should be good.
FYI for anyone not able to upgrade to Npgsql 7, the fix backported to 6.0.10 hasn't been released yet.
When configuring Hangfire in our ASP.NET Core app, we see dozens of idle connections in postgres.
We're using MassTransit with our server worker count set to 1. With four apps with this configuration and MassTransit using 5 connections each, we should expect to see around 20 connections or so for Hangfire. However, we see our connection pool maxed out at 100 connections, mostly idle Hangfire ones.
We had initially thought this might have been happening while configuring the Hangfire dashboard that was not the case. Once the service is run, the idle connections continue to increase until they run out.
Note that we have a Hangfire DB for each service that has a Hangfire server configured.
We tried using both v1.7.0 and v1.7.1 of the Nuget package with the same issue.
Relevant Startup.cs Services config sections:
Environment: -Azure Postgres SQL (v11) -ASP.NET Core 3.1 -Azure App Service -Azure Service Bus -MassTransit 7.0.3 -Hangfire.AspNetCore 1.7.12 -Hangfire.PostgreSql 1.7.1 (also tried 1.7.0)