hangfire-postgres / Hangfire.PostgreSql

PostgreSql Storage Provider for Hangfire
Other
350 stars 133 forks source link

Exception when Using NpgsqlDataSource with ConnectionFactory #372

Open bweber opened 2 weeks ago

bweber commented 2 weeks ago

Is there a recommended way to implement the ConnectionFactory. Here is what I have done:

public class ConnectionFactory(IServiceProvider serviceProvider) : IConnectionFactory
{
    public NpgsqlConnection GetOrCreateConnection() =>
        serviceProvider.GetRequiredService<NpgsqlDataSource>().CreateConnection();
}

Both ConnectionFactory and NpgsqlDataSource are registered as singletons: services.AddSingleton<IConnectionFactory, ConnectionFactory>();

I am seeing a ton of connection exceptions in the logs:

Npgsql.NpgsqlException (0x80004005): The operation has timed out
 ---> System.TimeoutException: The operation has timed out.
   at Npgsql.ThrowHelper.ThrowNpgsqlExceptionWithInnerTimeoutException(String message)
   at Npgsql.Util.NpgsqlTimeout.Check()
   at Npgsql.Util.NpgsqlTimeout.CheckAndGetTimeLeft()
   at Npgsql.Util.NpgsqlTimeout.CheckAndApply(NpgsqlConnector connector)
   at Npgsql.Internal.NpgsqlConnector.<Open>g__OpenCore|213_1(NpgsqlConnector conn, SslMode sslMode, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken, Boolean isFirstAttempt)
   at Npgsql.Internal.NpgsqlConnector.Open(NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.PoolingDataSource.OpenNewConnector(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.PoolingDataSource.<Get>g__RentAsync|34_0(NpgsqlConnection conn, NpgsqlTimeout timeout, Boolean async, CancellationToken cancellationToken)
   at Npgsql.NpgsqlConnection.<Open>g__OpenAsync|42_0(Boolean async, CancellationToken cancellationToken)
   at Hangfire.PostgreSql.PostgreSqlStorage.CreateAndOpenConnection()
   at Hangfire.PostgreSql.PostgreSqlConnection.AcquireLock(String resource, TimeSpan timeout)
   at Hangfire.PostgreSql.PostgreSqlConnection.AcquireDistributedLock(String resource, TimeSpan timeout)
   at Hangfire.Server.RecurringJobScheduler.UseConnectionDistributedLock[T](JobStorage storage, Func`2 action) in C:\projects\hangfire-525\src\Hangfire.Core\Server\RecurringJobScheduler.cs:line 423
   at Hangfire.Server.RecurringJobScheduler.EnqueueNextRecurringJobs(BackgroundProcessContext context) in C:\projects\hangfire-525\src\Hangfire.Core\Server\RecurringJobScheduler.cs:line 203
   at Hangfire.Server.RecurringJobScheduler.Execute(BackgroundProcessContext context) in C:\projects\hangfire-525\src\Hangfire.Core\Server\RecurringJobScheduler.cs:line 176
   at Hangfire.Server.BackgroundProcessDispatcherBuilder.ExecuteProcess(Guid executionId, Object state) in C:\projects\hangfire-525\src\Hangfire.Core\Server\BackgroundProcessDispatcherBuilder.cs:line 82
   at Hangfire.Processing.BackgroundExecution.Run(Action`2 callback, Object state) in C:\projects\hangfire-525\src\Hangfire.Core\Processing\BackgroundExecution.cs:line 118

This seems to happen after a couple of minutes.

dmitry-vychikov commented 2 weeks ago

Is there something wrong with your hangfire instance? Does this exception affect how the jobs are processed?

Does it happen if you use just a connection string without connection factory?

Please attach all code with hangfire configuration. Now it is not clear how you pass connection factory to hangfire.

Example using connection factory can be found here: https://github.com/hangfire-postgres/Hangfire.PostgreSql/issues/322#issuecomment-1710694316

bweber commented 2 weeks ago

We are using a managed identity with Google Cloud, so we are registering our NpgSqlDatasource like this:

    services.AddSingleton(BuildGoogleDataSource(configuration));

    private static NpgsqlDataSource BuildGoogleDataSource(IConfiguration configuration)
    {
        var credentials = GoogleCredential.GetApplicationDefault();
        var scopedCredentials = credentials.CreateScoped("https://www.googleapis.com/auth/sqlservice.login");

        var dataSourceBuilder = new NpgsqlDataSourceBuilder();
        dataSourceBuilder.UsePeriodicPasswordProvider((_, cancellationToken) =>
                new ValueTask<string>(scopedCredentials.UnderlyingCredential
                    .GetAccessTokenForRequestAsync(cancellationToken: cancellationToken)),
            TimeSpan.FromMinutes(1), TimeSpan.FromSeconds(0));

        dataSourceBuilder.ConnectionStringBuilder.ConnectionString = configuration.GetConnectionString("MyDatabase");

        return dataSourceBuilder.Build();
    }

Our Hangfire configuration is like this:

    services.AddSingleton<IConnectionFactory, ConnectionFactory>();

    services
            .AddHangfire((sp, options) =>
            {
                options.UsePostgreSqlStorage(o => o.UseConnectionFactory(sp.GetRequiredService<IConnectionFactory>()),
                new PostgreSqlStorageOptions { PrepareSchemaIfNecessary = false });
            })
            .AddHangfireServer(o => o.Queues = ["default"]);

The ConnectionFactory is in my original post.

Using the NpgSqlDatasource is working perfectly in our Healthcheck and EntityFramework configuration. We are only seeing connection exceptions using this with Hangfire.

dmitry-vychikov commented 2 weeks ago

We are only seeing connection exceptions using this with Hangfire.

So what is the real problem you are trying to solve? Is hangfire not working for you? Also what severity level is your exception logged with? Is it error or warning?

This exception is coming from Distributed lock inside of recurring job scheduler. This might mean that you have 2 or more servers that try to perform scheduled operations, while only one manages to acquire the lock. This is not a bug – just normal operation.

Or else there has been some work in hangfire to enable parallel execution of recurring jobs even within one server. Potentially, this could lead to similar exceptions, but I don't have experience with that new feature yet.

bweber commented 2 weeks ago

The main problem we are running into is these are firing as unhandled exceptions and triggering alerts in our application monitoring services. We are seeing the same exception with other things like ServerHeartbeatProcess as well.

It doesn't seem to be impacting Hangfire working as I see it handling background jobs and the dashboard works, but it seems to be some sort of connection timeout that isn't being gracefully handled, the subsequent usage of the connection throws an exception and then it tries to get a new connection, but since this is internal to Hangfire/Postgres library, my options there are limited to resolve it.

I could put some sort of filter in our logging config in our appsettings to downgrade this to a warning, but that may mask other issues in the future.

Thoughts?