npgsql / efcore.pg

Entity Framework Core provider for PostgreSQL
PostgreSQL License
1.57k stars 225 forks source link

Question about EnableRetryOnFailure #3248

Closed simon-k closed 3 weeks ago

simon-k commented 2 months ago

Hi I'm using npgsql in an Azure Function App, and I have some issues with EnableRetryOnFailure().

This is my code that is creating my EFCore DbContext and where I set the EnableRetryOnFailure option.

services.AddDbContext<ApplicationDbContext>((serviceProvider, options) =>
        {
            string connectionString = GetConnectionString(serviceProvider);
            options.UseNpgsql(connectionString, providerOptions =>
            {
                providerOptions.EnableRetryOnFailure();
            });               
        });

The retry logic simply does not seem to work for me. On rare occasions the connection to the DB fails with a System.Net.Socket.SocketException 11001 "No such host is known".

It should be a transient exception and therefore the retry logic should kick in, but it doesn't.

This blog How to do retries in EF Core describes the list of error codes that is consideres transient by EfCore, and I'm not sure my socket exception is included and the error codes does not seem to match this document with Npgsql error codes https://www.postgresql.org/docs/current/errcodes-appendix.html

What am I doing wrong here? Do I need to specify some explicit error codes in EnableRetryOnFailure(...) or is there something else I m missing?

roji commented 2 months ago

"No such host is known" isn't generally considered a transient error: this error indicates that the DNS hostname (e.g. foo.bar.com) could not be resolved to an IP address; retrying that DNS lookup right away generally will return the exact same results. In what scenario do you believe this is a transient error?

simon-k commented 2 months ago

In general I would agree with you, but in our case the DNS resolution is a transient error - at lease according to the Azure Support team. Our client is running as an App Service on Azure and it is using the Azure provided DNS service to resolve the PGSQL database URL to an IP address - and in rare ocations the DNS Server cannot resolve the IP.

Azure suport

Now, the issue is indeed transient and doesn’t last. The retry policy you described previously is expected to take care of the issue.

So if I just accept that the DNS we are using will in 0.001% og the time fails to resolve, and that it is a transient error, then I would really like the EnableRetryOnFailure to handle the issue for me. So which Error code should I pass on to the method as the errorCodesToAddargument for it to let it retry?

simon-k commented 2 months ago

So anyone knows what error code I should sepcify in this call to retry when I get the System.Net.Socket.SocketException 11001 exception?

optionsBuilder.UseNpgsql(connectionString, providerOptions =>
{
    providerOptions.EnableRetryOnFailure(errorCodesToAdd: ???? );
});
simon-k commented 2 months ago

Or could it be that the errorCodesToAdd in the NpgsqlRetryingExecutionStrategy only works on Postgres Exceptions and I therefore cannot rely on this to retry?

simon-k commented 2 months ago

"No such host is known" isn't generally considered a transient error: this error indicates that the DNS hostname (e.g. foo.bar.com) could not be resolved to an IP address; retrying that DNS lookup right away generally will return the exact same results. In what scenario do you believe this is a transient error?

When I look at the implemention of NpgsqlException it seems like socket exceptions are treated as transient exceptions. Or am I readign this code wrong?

    public virtual bool IsTransient
        => InnerException is IOException or SocketException or TimeoutException or NpgsqlException { IsTransient: true };

https://github.com/npgsql/npgsql/blob/3fce77d7ce92b4ac984ce3afb4364b23b791fb05/src/Npgsql/NpgsqlException.cs#L46

roji commented 1 month ago

@simon-k sorry for not answering for so long.

You're right that errorCodesToAdd is only for additional PostgreSQL error codes - but here you're trying to add a non-PG error code (DNS resolution failure).

However, you should be able to easily extend NpgsqlRetryingExecutionStrategy, and then pass that to ExecutionStrategy() inside UseNpgsql() (instead of calling EnableRetryOnFailure(), which just sets up NpgsqlRetryingExecutionStrategy). At that point you fully control what EF considers a transient error, and can decide to treat DNS issues as such.

Regardless, thanks for providing the communication with Azure support. I'll bring this up with the EF team to discuss whether it makes sense to treat DNS failures as transient by default.

roji commented 3 weeks ago

We've discussed this in the EF team and we think we need more feedback - opened https://github.com/dotnet/efcore/issues/34950 on the EF repo to do that.