Closed OskarKlintrot closed 1 year ago
Have you tried with Microsoft.Data.SqlClient 5.1.0-preview2 ?
https://github.com/dotnet/SqlClient/releases/tag/v5.1.0-preview2
No, I haven't. For some reason we don't lose locks in our test env but in our acceptance test env (and production, which the graph is from). I'll deploy the two worst offenders with the preview version to the acceptance test env and we'll see in a few days if it makes any changes.
Can you share more detail?
HandleLostToken
I assume?Confirm that you are using the latest DistributedLock.SqlServer Yes, version 1.0.2.
How are you creating the lock? How are you acquiring the lock?
private async Task<IDistributedSynchronizationHandle?> GetHandleAsync(CancellationToken cancellationToken)
{
using var _ = _logger.BeginScope(_state);
while (!cancellationToken.IsCancellationRequested)
{
var failure = true;
try
{
_logger.LogInformation("Trying to acquire lock for {Job Name}", _jobName);
var handle = await _distributedLockProvider.AcquireLockAsync(
_lockName,
timeout: null,
cancellationToken: cancellationToken);
failure = false;
return handle;
}
// Multiple catch clauses
finally
{
if (failure)
{
await Delay.TryWait(TimeSpan.FromSeconds(5), cancellationToken);
}
}
}
return null;
}
How are you detecting lost locks? HandleLostToken I assume? That's correct:
cancellationTokenRegistration = handle.HandleLostToken.Register(() =>
{
// HandleLostToken.Register is to slow to use for anything else than logging.
_logger.LogError("Lost lock for job {Job Name}.", _jobName);
_telemetryClient.TrackEvent(new($"Lost lock for {_jobName}")
{
Properties =
{
{ "Job Name", _jobname },
{ "Lock State", "Lost" }
},
});
});
What exceptions are you seeing?
I can't find any logged exceptions. I don't think we are suppressing any. I think we might be actually, doh... Once you got the handle, is there any exceptions being thrown?
I made this repo for a previous bug but that's pretty much still how we are using the library.
I noticed something really interesting/weird. 2/3 of the losses of the two services that loses the most locks occurs around midnight. In one env it's 23:45-00:00 and in another it's 00:00-00:05. I have no clue if it's related to this issue though, probably not. I wonder if Azure is doing something around midnight...
Once you got the handle, is there any exceptions being thrown?
If the connection is broken, I would expect disposing the handle to throw an exception since it will try to execute the release sproc and that will fail. Are you seeing any exceptions there?
In one env it's 23:45-00:00 and in another it's 00:00-00:05
Are you able to repro this at all locally? I assume not since you said just in acceptance test...
How are you detecting lost locks? HandleLostToken I assume? That's correct:
Is it possible for you to try and repro this without subscribing to HandleLostToken
? I want to try and isolate whether this is an issue that is being caused by handle monitoring or whether it is being detected by handle monitoring.
If the connection is broken, I would expect disposing the handle to throw an exception since it will try to execute the release sproc and that will fail. Are you seeing any exceptions there?
I made a mistake and didn't realize that there could be an exception thrown from there. I accidently suppress any exceptions thrown during dispose. I'll have to fix that and deploy the services.
Are you able to repro this at all locally? I assume not since you said just in acceptance test...
Unfortunately, no. Maybe if I leave a job running for a few days on a computer but I'll hope I can get some exceptions that makes sense from production instead.
Is it possible for you to try and repro this without subscribing to HandleLostToken? I want to try and isolate whether this is an issue that is being caused by handle monitoring or whether it is being detected by handle monitoring.
Sure, I think I could just use .WasLost()
instead so I still can log when the lock have been lost. I'll start with logging the exceptions to get a picture of what is happening before I change.
Sounds good @OskarKlintrot . FWIW I ran the unit test suite with the 5.0.1 upgrade and everything passes, so no luck there.
Another thing that would be interesting to test on your end is to try different versions of Microsoft.Data.SqlClient and see which one introduced the behavior.
There are a lot of different suggestions flying around here so I thought it might make sense to track with a list:
Dispose()
and report backHandleLostToken
subscription (but still logging Dispose exceptions to get a delayed sense of lock losses)@OskarKlintrot We did see some weird SQL connectivity issues in one of our microservices after upgrade to net7 We have just upgraded that service to the newly released Microsoft.Data.SqlClient 5.1.0 - and initial tests indicates that the issue is gone.
I let it run for two days and on day 2 the behaviour come back. So no luck there I would say.
Since I can't go any lower than the one required by EF Core 7 I don't think this is feasible, unfortunately.
[x] Log exceptions from Dispose() and report back
It doesn't seem to be any exceptions from Dispose()
. On the same instance that lost it locks and at the same time I now see this exception:
BeginExecuteNonQuery requires an open and available Connection. The connection's current state is closed.
Stack trace:
System.InvalidOperationException:
at Microsoft.Data.SqlClient.SqlCommand.ValidateCommand (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
at Microsoft.Data.SqlClient.SqlCommand.InternalExecuteNonQuery (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
at Microsoft.Data.SqlClient.SqlCommand.BeginExecuteNonQueryInternal (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
at Microsoft.Data.SqlClient.SqlCommand.BeginExecuteNonQueryAsync (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
at System.Threading.Tasks.TaskFactory`1.FromAsyncImpl (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at Microsoft.Data.SqlClient.SqlCommand.InternalExecuteNonQueryAsync (Microsoft.Data.SqlClient, Version=5.0.0.0, Culture=neutral, PublicKeyToken=23ec7fc2d6eaa4a5)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at Medallion.Threading.Internal.Data.DatabaseCommand+<ExecuteAsync>d__12`1.MoveNext (DistributedLock.Core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=12bc08512096ade0: /_/DistributedLock.Core/Internal/Data/DatabaseCommand.cs:151)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.ConfiguredValueTaskAwaitable`1+ConfiguredValueTaskAwaiter.GetResult (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at Medallion.Threading.SqlServer.SqlApplicationLock+<ExecuteReleaseCommandAsync>d__16.MoveNext (DistributedLock.SqlServer, Version=1.0.0.0, Culture=neutral, PublicKeyToken=12bc08512096ade0: /_/DistributedLock.SqlServer/SqlApplicationLock.cs:71)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at Medallion.Threading.Internal.Data.MultiplexedConnectionLock+<ReleaseAsync>d__12`1.MoveNext (DistributedLock.Core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=12bc08512096ade0: /_/DistributedLock.Core/Internal/Data/MultiplexedConnectionLock.cs:159)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at Medallion.Threading.Internal.Data.MultiplexedConnectionLock+<ReleaseAsync>d__12`1.MoveNext (DistributedLock.Core, Version=1.0.0.0, Culture=neutral, PublicKeyToken=12bc08512096ade0: /_/DistributedLock.Core/Internal/Data/MultiplexedConnectionLock.cs:173)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at My code+<ExecuteAsync>d__9.MoveNext (My Assembly, Version=7.1.2.0, Culture=neutral, PublicKeyToken=null: /_/Hosting/My Assembly/Lock/NonConcurrent/NonConcurrentLock.cs:115)
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification (System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
at My Code+<ExecuteContinuously>d__15.MoveNext (My Assembly, Version=7.1.2.0, Culture=neutral, PublicKeyToken=null: /_/Hosting/My Assembly/CronJob.cs:324)
Do you still think this is worth doing? (see below)
On hold (see below)
Given that the exceptions (there a some other as well from another occasion all related to db connection issues, Azure admits the DB became unavailable at that time) points to some hiccup with the db connection I doubt that this issue has anything to do with DistributedLock
but rather Microsoft.Data.SqlClient
and Azure SQL Server. Do you (@hoerup and @madelson) agree with me? If so I think we can close this issue. I can always re-open it if I find something new.
@OskarKlintrot We did see some weird SQL connectivity issues in one of our microservices after upgrade to net7 We have just upgraded that service to the newly released Microsoft.Data.SqlClient 5.1.0 - and initial tests indicates that the issue is gone.
@hoerup Oh, that's very interesting! Are you also using Azure SQL perhaps? When I upgraded to the latest preview everything seemed fine for a day until the issues come back. Please let me know if it works for you! If it does we will probably also upgrade to 5.1.0.
This issue seems somewhat similar to what I am seeing, not sure if this issue is related. Is there any retry-logic in DistributedLock.SqlServer
btw?
@OskarKlintrot We did see some weird SQL connectivity issues in one of our microservices after upgrade to net7 We have just upgraded that service to the newly released Microsoft.Data.SqlClient 5.1.0 - and initial tests indicates that the issue is gone.
@hoerup Oh, that's very interesting! Are you also using Azure SQL perhaps? When I upgraded to the latest preview everything seemed fine for a day until the issues come back. Please let me know if it works for you! If it does we will probably also upgrade to 5.1.0.
I might have been to fast there - just saw some connection errors again :(
we are using SQL server on-prem
I might have been to fast there - just saw some connection errors again :(
Too bad!
we are using SQL server on-prem
Sounds like the issue is with Microsoft.Data.SqlClient
then. Are you seeing the issues from other places than DistributedLock.SqlServer
as well? We do, we see them from EF as well.
Is there any retry-logic in DistributedLock.SqlServer btw?
There isn't any retry logic. Especially once we grab the lock if the connection dies we can't really retry at that point. Even if we reaquire the lock someone else still could have held it during the blip so we want to report as a lost handle.
I doubt that this issue has anything to do with DistributedLock but rather Microsoft.Data.SqlClient and Azure SQL Server
we see them from EF as well.
@OskarKlintrot that does seem likely if you're encountering a spike across uses of Microsoft.Data.SqlClient and not just in distributed lock
As a next step, I think it makes sense to report an issue on the Microsoft.Data.SqlClient github and link it here. Do you agree? Then we can close this for now while that plays out. The issue you already linked does seem related but that was closed without action. If we were able to pin down the appearance of the problem to a specific release of the client that might help it gain traction, as would creating a standalone repro that can run against Azure.
@OskarKlintrot any updates here? Did you end up filing an issue with Microsoft?
Still working on it, got sidetracked with a man cold :)
@OskarKlintrot sorry to hear that; hope you feel better soon :-)
Sorry, got buried in other work. Closing this for now and tracking the issue here. Might re-open this issue if Microsoft.Data.SqlClient don't think it has anything to do with them.
Thanks!
@hoerup will you be able to provide a minimal sample that could repro the issue?
We've seen a dramatic increase in number of lost locks since we upgraded to .NET 7 and EF Core 7. EF Core 7 uses
Microsoft.Data.SqlClient (>= 5.0.1)
but EF Core 6 usedMicrosoft.Data.SqlClient (>= 2.1.4)
(DistributedLock.SqlServer
have a dependency toMicrosoft.Data.SqlClient (>= 2.1.1)
).We started upgrading our services before Christmas and around the same time the amount of lost locks started to increase from a few a week to 10-50 a day.
The only change we have done that we figure might have an impact is upgrading from .NET 6 to .NET 7 and
Microsoft.Data.SqlClient
from4.1.0
to5.0.1
.