madelson / DistributedLock

A .NET library for distributed synchronization
MIT License
1.75k stars 182 forks source link

Increase in lost lock after upgrading to .NET 7, EF Core 7 and Microsoft.Data.SqlClient 5.0.1 #160

Closed OskarKlintrot closed 1 year ago

OskarKlintrot commented 1 year ago

We've seen a dramatic increase in number of lost locks since we upgraded to .NET 7 and EF Core 7. EF Core 7 uses Microsoft.Data.SqlClient (>= 5.0.1) but EF Core 6 used Microsoft.Data.SqlClient (>= 2.1.4) (DistributedLock.SqlServer have a dependency to Microsoft.Data.SqlClient (>= 2.1.1)).

We started upgrading our services before Christmas and around the same time the amount of lost locks started to increase from a few a week to 10-50 a day.

image

The only change we have done that we figure might have an impact is upgrading from .NET 6 to .NET 7 and Microsoft.Data.SqlClient from 4.1.0 to 5.0.1.

hoerup commented 1 year ago

Have you tried with Microsoft.Data.SqlClient 5.1.0-preview2 ?

https://github.com/dotnet/SqlClient/releases/tag/v5.1.0-preview2

OskarKlintrot commented 1 year ago

No, I haven't. For some reason we don't lose locks in our test env but in our acceptance test env (and production, which the graph is from). I'll deploy the two worst offenders with the preview version to the acceptance test env and we'll see in a few days if it makes any changes.

madelson commented 1 year ago

Can you share more detail?

OskarKlintrot commented 1 year ago

Confirm that you are using the latest DistributedLock.SqlServer Yes, version 1.0.2.

How are you creating the lock? How are you acquiring the lock?

private async Task<IDistributedSynchronizationHandle?> GetHandleAsync(CancellationToken cancellationToken)
{
    using var _ = _logger.BeginScope(_state);

    while (!cancellationToken.IsCancellationRequested)
    {
        var failure = true;

        try
        {
            _logger.LogInformation("Trying to acquire lock for {Job Name}", _jobName);

            var handle = await _distributedLockProvider.AcquireLockAsync(
                _lockName,
                timeout: null,
                cancellationToken: cancellationToken);

            failure = false;

            return handle;
        }
        // Multiple catch clauses
        finally
        {
            if (failure)
            {
                await Delay.TryWait(TimeSpan.FromSeconds(5), cancellationToken);
            }
        }
    }

    return null;
}

How are you detecting lost locks? HandleLostToken I assume? That's correct:

cancellationTokenRegistration = handle.HandleLostToken.Register(() =>
{
    // HandleLostToken.Register is to slow to use for anything else than logging.
    _logger.LogError("Lost lock for job {Job Name}.", _jobName);

    _telemetryClient.TrackEvent(new($"Lost lock for {_jobName}")
    {
        Properties =
        {
            { "Job Name", _jobname },
            { "Lock State", "Lost" }
        },
    });
});

What exceptions are you seeing? I can't find any logged exceptions. I don't think we are suppressing any. I think we might be actually, doh... Once you got the handle, is there any exceptions being thrown?

I made this repo for a previous bug but that's pretty much still how we are using the library.

OskarKlintrot commented 1 year ago

I noticed something really interesting/weird. 2/3 of the losses of the two services that loses the most locks occurs around midnight. In one env it's 23:45-00:00 and in another it's 00:00-00:05. I have no clue if it's related to this issue though, probably not. I wonder if Azure is doing something around midnight...

madelson commented 1 year ago

Once you got the handle, is there any exceptions being thrown?

If the connection is broken, I would expect disposing the handle to throw an exception since it will try to execute the release sproc and that will fail. Are you seeing any exceptions there?

In one env it's 23:45-00:00 and in another it's 00:00-00:05

Are you able to repro this at all locally? I assume not since you said just in acceptance test...

How are you detecting lost locks? HandleLostToken I assume? That's correct:

Is it possible for you to try and repro this without subscribing to HandleLostToken? I want to try and isolate whether this is an issue that is being caused by handle monitoring or whether it is being detected by handle monitoring.

OskarKlintrot commented 1 year ago

If the connection is broken, I would expect disposing the handle to throw an exception since it will try to execute the release sproc and that will fail. Are you seeing any exceptions there?

I made a mistake and didn't realize that there could be an exception thrown from there. I accidently suppress any exceptions thrown during dispose. I'll have to fix that and deploy the services.

Are you able to repro this at all locally? I assume not since you said just in acceptance test...

Unfortunately, no. Maybe if I leave a job running for a few days on a computer but I'll hope I can get some exceptions that makes sense from production instead.

Is it possible for you to try and repro this without subscribing to HandleLostToken? I want to try and isolate whether this is an issue that is being caused by handle monitoring or whether it is being detected by handle monitoring.

Sure, I think I could just use .WasLost() instead so I still can log when the lock have been lost. I'll start with logging the exceptions to get a picture of what is happening before I change.

madelson commented 1 year ago

Sounds good @OskarKlintrot . FWIW I ran the unit test suite with the 5.0.1 upgrade and everything passes, so no luck there.

Another thing that would be interesting to test on your end is to try different versions of Microsoft.Data.SqlClient and see which one introduced the behavior.

There are a lot of different suggestions flying around here so I thought it might make sense to track with a list:

hoerup commented 1 year ago

@OskarKlintrot We did see some weird SQL connectivity issues in one of our microservices after upgrade to net7 We have just upgraded that service to the newly released Microsoft.Data.SqlClient 5.1.0 - and initial tests indicates that the issue is gone.

OskarKlintrot commented 1 year ago

I let it run for two days and on day 2 the behaviour come back. So no luck there I would say.

Since I can't go any lower than the one required by EF Core 7 I don't think this is feasible, unfortunately.

BeginExecuteNonQuery requires an open and available Connection. The connection's current state is closed.
image

Stack trace:

System.InvalidOperationException:
   at Microsoft.Data.SqlClient.SqlCommand.ValidateCommand (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
   at Microsoft.Data.SqlClient.SqlCommand.InternalExecuteNonQuery (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
   at Microsoft.Data.SqlClient.SqlCommand.BeginExecuteNonQueryInternal (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
   at Microsoft.Data.SqlClient.SqlCommand.BeginExecuteNonQueryAsync (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
   at System.Threading.Tasks.TaskFactory`1.FromAsyncImpl (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Microsoft.Data.SqlClient.SqlCommand.InternalExecuteNonQueryAsync (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Medallion.Threading.Internal.Data.DatabaseCommand+<ExecuteAsync>d__12`1.MoveNext (DistributedLock.Core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=12bc08512096ade0: /_/DistributedLock.Core/Internal/Data/DatabaseCommand.cs:151)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.ConfiguredValueTaskAwaitable`1+ConfiguredValueTaskAwaiter.GetResult (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Medallion.Threading.SqlServer.SqlApplicationLock+<ExecuteReleaseCommandAsync>d__16.MoveNext (DistributedLock.SqlServer, Version=1.0.0.0, Culture=neutral, PublicKeyToken=12bc08512096ade0: /_/DistributedLock.SqlServer/SqlApplicationLock.cs:71)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Medallion.Threading.Internal.Data.MultiplexedConnectionLock+<ReleaseAsync>d__12`1.MoveNext (DistributedLock.Core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=12bc08512096ade0: /_/DistributedLock.Core/Internal/Data/MultiplexedConnectionLock.cs:159)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Medallion.Threading.Internal.Data.MultiplexedConnectionLock+<ReleaseAsync>d__12`1.MoveNext (DistributedLock.Core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=12bc08512096ade0: /_/DistributedLock.Core/Internal/Data/MultiplexedConnectionLock.cs:173)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at My code+<ExecuteAsync>d__9.MoveNext (My Assembly, Version=7.1.2.0, Culture=neutral, PublicKeyToken=null: /_/Hosting/My Assembly/Lock/NonConcurrent/NonConcurrentLock.cs:115)
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at My Code+<ExecuteContinuously>d__15.MoveNext (My Assembly, Version=7.1.2.0, Culture=neutral, PublicKeyToken=null: /_/Hosting/My Assembly/CronJob.cs:324)

Do you still think this is worth doing? (see below)

On hold (see below)


Given that the exceptions (there a some other as well from another occasion all related to db connection issues, Azure admits the DB became unavailable at that time) points to some hiccup with the db connection I doubt that this issue has anything to do with DistributedLock but rather Microsoft.Data.SqlClient and Azure SQL Server. Do you (@hoerup and @madelson) agree with me? If so I think we can close this issue. I can always re-open it if I find something new.

OskarKlintrot commented 1 year ago

@OskarKlintrot We did see some weird SQL connectivity issues in one of our microservices after upgrade to net7 We have just upgraded that service to the newly released Microsoft.Data.SqlClient 5.1.0 - and initial tests indicates that the issue is gone.

@hoerup Oh, that's very interesting! Are you also using Azure SQL perhaps? When I upgraded to the latest preview everything seemed fine for a day until the issues come back. Please let me know if it works for you! If it does we will probably also upgrade to 5.1.0.

OskarKlintrot commented 1 year ago

This issue seems somewhat similar to what I am seeing, not sure if this issue is related. Is there any retry-logic in DistributedLock.SqlServer btw?

hoerup commented 1 year ago

@OskarKlintrot We did see some weird SQL connectivity issues in one of our microservices after upgrade to net7 We have just upgraded that service to the newly released Microsoft.Data.SqlClient 5.1.0 - and initial tests indicates that the issue is gone.

@hoerup Oh, that's very interesting! Are you also using Azure SQL perhaps? When I upgraded to the latest preview everything seemed fine for a day until the issues come back. Please let me know if it works for you! If it does we will probably also upgrade to 5.1.0.

I might have been to fast there - just saw some connection errors again :(

we are using SQL server on-prem

OskarKlintrot commented 1 year ago

I might have been to fast there - just saw some connection errors again :(

Too bad!

we are using SQL server on-prem

Sounds like the issue is with Microsoft.Data.SqlClient then. Are you seeing the issues from other places than DistributedLock.SqlServer as well? We do, we see them from EF as well.

madelson commented 1 year ago

Is there any retry-logic in DistributedLock.SqlServer btw?

There isn't any retry logic. Especially once we grab the lock if the connection dies we can't really retry at that point. Even if we reaquire the lock someone else still could have held it during the blip so we want to report as a lost handle.

I doubt that this issue has anything to do with DistributedLock but rather Microsoft.Data.SqlClient and Azure SQL Server

we see them from EF as well.

@OskarKlintrot that does seem likely if you're encountering a spike across uses of Microsoft.Data.SqlClient and not just in distributed lock

As a next step, I think it makes sense to report an issue on the Microsoft.Data.SqlClient github and link it here. Do you agree? Then we can close this for now while that plays out. The issue you already linked does seem related but that was closed without action. If we were able to pin down the appearance of the problem to a specific release of the client that might help it gain traction, as would creating a standalone repro that can run against Azure.

madelson commented 1 year ago

@OskarKlintrot any updates here? Did you end up filing an issue with Microsoft?

OskarKlintrot commented 1 year ago

Still working on it, got sidetracked with a man cold :)

madelson commented 1 year ago

@OskarKlintrot sorry to hear that; hope you feel better soon :-)

OskarKlintrot commented 1 year ago

Sorry, got buried in other work. Closing this for now and tracking the issue here. Might re-open this issue if Microsoft.Data.SqlClient don't think it has anything to do with them.

madelson commented 1 year ago

Thanks!

JRahnama commented 1 year ago

@hoerup will you be able to provide a minimal sample that could repro the issue?