madelson / DistributedLock

A .NET library for distributed synchronization
MIT License
1.75k stars 182 forks source link

ZooKeeperNetEx connection loss issue: an acquired lock seems is not released #156

Open MattGhafouri opened 1 year ago

MattGhafouri commented 1 year ago

I've implemented a lock with the Zookeeper with this configuration :

  1. DistributedLock.ZooKeeper - Version="1.0.0"
  2. dotnet version 6.0
  3. Hosted on K8s (one pod, there is no concurrent request)
  4. Zookeeper server configuration on K8s :

version: "3.9" services: zk1: container_name: zk1 hostname: zk1 image: bitnami/zookeeper:3.8.0-debian-11-r57 ports:

  • 2181:2181 environment:
  • ALLOW_ANONYMOUS_LOGIN=yes
  • ZOO_SERVER_ID=1
  • ZOO_SERVERS=0.0.0.0:2888:3888
  • ZOO_MAX_CLIENT_CNXNS=500

There are several worker services inside the application, each of them working with a different lock key. periodically it tries to accuqire the lock and do some processes. It seems they are working without problem, but after a while, I get this exception Locking failed.Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.

It seems the lock cannot be acquired because it has not been released, although there is no concurrent request for the lock key.

The LockService code in dotnet :

    `
     private TimeSpan _connectionTimeoutInSecond = TimeSpan.FromSeconds(30);
     private TimeSpan _waitingForLockInSecond = TimeSpan.FromSeconds(30);
     public async Task<LockProcessResult> DoActionWithLockAsync(string lockKey, Func<Task> func)
       {
      var processResult = new LockProcessResult();
      try
      {
        var @lock = new ZooKeeperDistributedLock(lockKey, _configuration.ConnectionString, opt =>
        {
            opt.ConnectTimeout(_connectionTimeoutInSecond);
        });

        await using (var handle = await @lock.TryAcquireAsync(timeout: _waitingForLockInSecond))
        {
            if (handle != null)
            {
                // I have the lock 
                await func(); 
            }
            else
            {
                processResult.SetException(new LockAcquisitionFailedException(lockKey)); 
            }
        }

     }
     catch (Exception ex)
     {
        //I got the exceptions here
        processResult.SetException(ex); 
     }

     return processResult;
 }`

I appreciate any suggestion

hoerup commented 1 year ago

Have you checked your zookeeper ? Made sure that there aren't any pod/container restarts - and checked its log for indications on why it would drop a connection ?

MattGhafouri commented 1 year ago

Have you checked your zookeeper ? Made sure that there aren't any pod/container restarts - and checked its log for indications on why it would drop a connection ?

Thanks, I'll check it.

madelson commented 1 year ago

@MajeQafouri

Have you checked your zookeeper ? Made sure that there aren't any pod/container restarts - and checked its log for indications on why it would drop a connection ?

+1

It seems the lock cannot be acquired because it has not been released

What makes you think this is the issue? The exception you reported mentions connection loss.

after a while, I get this exception

Can you please report the full exception.ToString() including the stack trace? This will allow us to see where inside DistributedLock it is hitting the exception.

MattGhafouri commented 1 year ago

This is the stack trace :

Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.

Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. at org.apache.zookeeper.ZooKeeper.createAsync(String path, Byte[] data, List1 acl, CreateMode createMode) in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 535 at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.CreateEphemeralSequentialNode(ZooKeeperConnection connection, ZooKeeperPath directory, String namePrefix, IEnumerable1 aclEnumerable, Boolean ensureDirectoryExists) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperNodeCreator.cs:line 38 at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.CreateEphemeralSequentialNode(ZooKeeperConnection connection, ZooKeeperPath directory, String namePrefix, IEnumerable1 aclEnumerable, Boolean ensureDirectoryExists) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperNodeCreator.cs:line 54 at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func2 hasAcquired, Func4 waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSynchronizationHelper.cs:line 70 at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func2 hasAcquired, Func`4 waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSynchronizationHelper.cs:line 132 at Medallion.Threading.ZooKeeper.ZooKeeperDistributedLock.Medallion.Threading.Internal.IInternalDistributedLock.InternalTryAcquireAsync(TimeoutValue timeout, CancellationToken cancellationToken) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperDistributedLock.cs:line 97

madelson commented 1 year ago

@MajeQafouri looks like a zookeeper issue. Did you try checking the logs that @hoerup suggested?

With something like this I would expect that the application might hit some errors but would ultimately recover so long as you were catching those errors (e. g. in a web app you might have some failed requests due to the lost connection but later on a new connection would be established and it would work again). What are you finding?

MattGhafouri commented 1 year ago

Actually, after hosting on different envs, I can say It would relate to the client hosting environment, let me make it clear.

client host env: local Docker desktop (win 10 64x) zookeeper: hosted on k8s result: facing the same exception once in a while

client host env: running in debug mode of Visual Studio 2022(win 10 64x) zookeeper: hosted on k8s result: working without problem

client host env: hosted on k8s zookeeper: hosted on k8s result: facing the same exception once in a while.

But, still I don't know what is the issue and how should I fix it

MattGhafouri commented 1 year ago

@MajeQafouri looks like a zookeeper issue. Did you try checking the logs that @hoerup suggested?

With something like this I would expect that the application might hit some errors but would ultimately recover so long as you were catching those errors (e. g. in a web app you might have some failed requests due to the lost connection but later on a new connection would be established and it would work again). What are you finding?

It seems in case of an exception the acquired lock is not released, and even during the next tries, it cannot be acquired.

MattGhafouri commented 1 year ago

Have you checked your zookeeper ? Made sure that there aren't any pod/container restarts - and checked its log for indications on why it would drop a connection ?

As I described different client hosting envs, I can say it's not related to pod/container restarts,

MattGhafouri commented 1 year ago

@MajeQafouri looks like a zookeeper issue. Did you try checking the logs that @hoerup suggested?

With something like this I would expect that the application might hit some errors but would ultimately recover so long as you were catching those errors (e. g. in a web app you might have some failed requests due to the lost connection but later on a new connection would be established and it would work again). What are you finding?

I forgot to mention I don't receive requests from outside, my application is a web application that has more than 50 workers, and in a timely manner every five seconds trying to do some functionality( with different lock keys(specific for each worker)).

Is it possible the issue relates to the number of TCP sockets for each request we are opening? maybe the reason is that it works without issue on windows(visual studio) and as I containerize the application it starts to face some issues.

madelson commented 1 year ago

Is it possible the issue relates to the number of TCP sockets for each request we are opening?

This is possible, but I really would recommend pulling the logs from the zookeeper side and seeing what zookeeper thinks is the reason for connection loss (e.g. is the client timing out, too many connections, etc).

hoerup commented 1 year ago

@MajeQafouri Have you seen this one https://github.com/shayhatsor/zookeeper/issues/45

MattGhafouri commented 1 year ago

@MajeQafouri Have you seen this one shayhatsor/zookeeper#45

Thanks for mentioning the link, I believe the root of the issue is the same. but do you have any solution for that?

MattGhafouri commented 1 year ago

I tried to directly use the zookeeper library https://www.nuget.org/profiles/shayhatsor2 I faced the same issue in the dockerized environment, then traced the Nuget package code, and got an exception in the WriteLock method. Maybe the author can help us @shayhatsor

WARNING WriteLock Caught: org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. at org.apache.zookeeper.ZooKeeper.deleteAsync(String path, Int32 version) in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 587 at org.apache.zookeeper.recipes.lock.WriteLock.DeleteNode.execute() in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx.Recipes\Lock\WriteLock.cs:line 120 at org.apache.zookeeper.recipes.lock.WriteLock.unlock() in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx.Recipes\Lock\WriteLock.cs:line 85]
Exc level 0: org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. at org.apache.zookeeper.ZooKeeper.deleteAsync(String path, Int32 version) in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 587 at org.apache.zookeeper.recipes.lock.WriteLock.DeleteNode.execute() in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx.Recipes\Lock\WriteLock.cs:line 120 at org.apache.zookeeper.recipes.lock.WriteLock.unlock() in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx.Recipes\Lock\WriteLock.cs:line 85

madelson commented 1 year ago

@MajeQafouri I'm looking into whether we can work around this problem within DistributedLock itself.

madelson commented 1 year ago

@MajeQafouri I've published a prerelease version of the Medallion.ZooKeeper package here: https://www.nuget.org/packages/DistributedLock.ZooKeeper/1.0.1-alpha001

This replaces the ZooKeeperNetEx dependency with my initial attempt at resolving the issue (I haven't reproduced the issue locally so we'll see how it goes).

Can you pull down the prerelease and let me know whether that resolves your issue? Thanks!

madelson commented 1 year ago

@MajeQafouri did you get a chance to test out the prerelease package?

devlnull commented 1 year ago

Same Issue Here! Actually I just faced with this issue with 3 pods trying to use zookeeper in k8s.. I'll try to use the alpha package and give a feedback here

MattGhafouri commented 1 year ago

@MajeQafouri did you get a chance to test out the prerelease package?

Sorry for the late answer, unfortunately, I'm packed these days, and couldn't test the new Package, BTW thanks for the effort. As soon as I manage to test it, keep you posted.

devlnull commented 1 year ago

Hi, We could finally test the alpha package. I don't know if it helps, but We've test it and compare with old package, the result was acceptable. no connection loss anymore. Good luck.

madelson commented 1 year ago

@devlnull excellent. Thanks for testing! Feel free to use the prerelease version for now and keep me posted on any issues you encounter. The only change is the ZooKeeperNetEx alternative.

If @MajeQafouri also comes back with a good test result I'll publish it as a stable version.

MattGhafouri commented 1 year ago

Finally, I tested the new alpha version, overall It's much better with just a few connection Loss Exception, The Hosting environment :

client host env: hosted on k8s (3 pods) zookeeper: hosted on k8s (1 pod) result: facing some Connection loss exceptions rarely.

The exception stack trace :

.ZookeeperLockProviderService[0] ---------Locking failed. Key:MyKeyName--.Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. at org.apache.zookeeper.ZooKeeper.createAsync(String path, Byte[] data, List1 acl, CreateMode createMode) in C:\dev\zookeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 542 at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.<>c__DisplayClass1_0. <<CreateEphemeralSequentialNode>g__EnsureDirectoryAndGetCreatedPathsAsync|0>d.MoveNext() in /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 72 --- End of stack trace from previous location --- at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.<>c__DisplayClass1_0. <<CreateEphemeralSequentialNode>g__EnsureDirectoryAndGetCreatedPathsAsync|0>d.MoveNext() in /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 64 --- End of stack trace from previous location --- at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.CreateEphemeralSequentialNode(ZooKeeperConnection connection, ZooKeeperPath directory, String namePrefix, IEnumerable1 aclEnumerable, Boolean ensureDirectoryExists) in //DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 27 at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func2 hasAcquired, Func4 waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in //DistributedLock.ZooKeeper/ZooKeeperSynchronizationHelper.cs:line 65 at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func2 hasAcquired, Func4 waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in //DistributedLock.ZooKeeper/ZooKeeperSynchronizationHelper.cs:line 127 at Medallion.Threading.ZooKeeper.ZooKeeperDistributedLock.Medallion. Threading.Internal.IInternalDistributedLock<Medallion.Threading. ZooKeeper.ZooKeeperDistributedLockHandle>.InternalTryAcquireAsync(TimeoutValue timeout, CancellationToken cancellationToken) in //DistributedLock.ZooKeeper/ZooKeeperDistributedLock.cs:line 91 at .ZookeeperLockProviderService.DoActionWithLockAsync[TParameter](String lockKey, TParameter parameter, Func`2 func) in /src/src/Infrastructure//ZookeeperLockProviderService.cs:line 113 .ZookeeperLockProviderService[0]

I hope it would be helpful.

madelson commented 1 year ago

Thanks @MajeQafouri !

For these few remaining connection losses, do they also appear in the Zookeeper logs or are they still “phantom connection losses”? Does the application recover after this happens or no?

MattGhafouri commented 1 year ago

In the Zookeeper log, we don't get exception logs, but on the app side we have them, and the Application seems to recover itself after connection loss exception, actually after sessionExpiredException.

again I run the app on k8s with 3 pods and zookeeper service on one pod, I send you the app log also the zookeeper log, you can also check the acquirement and release logs which include exceptions, I hope it would be helpful.

MyApp_Pod1_Logs.txt MyApp_Pod2_Logs.txt MyApp_Pod3_Logs.txt logs-from-zookeeper.txt

madelson commented 1 year ago

@MajeQafouri I do see a disconnect/session expired log in the zookeeper log, though. Could that be related?

MattGhafouri commented 1 year ago

I hardly believe so, but I couldn't get why we get this error once in a while in Zookeeper console :"Connection reset by peer"

devlnull commented 1 year ago

Unfortunately we are getting Connection Loss sometimes, but it will be gone in a minute.

Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.

madelson commented 1 year ago

@devlnull are you seeing "connection reset by peer"/"connection expired" in the zookeeper logs like @MajeQafouri is?

devlnull commented 1 year ago

@madelson Actually I don't have access to zookeeper logs yet, but I will soon. for now I have this stack trace info, it might help.

   at org.apache.zookeeper.ZooKeeper.createAsync(String path, Byte[] data, List`1 acl, CreateMode createMode) in C:\dev\zookeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 542
   at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.<>c__DisplayClass1_0.<<CreateEphemeralSequentialNode>g__EnsureDirectoryAndGetCreatedPathsAsync|0>d.MoveNext() in /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 72
--- End of stack trace from previous location ---
   at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.<>c__DisplayClass1_0.<<CreateEphemeralSequentialNode>g__EnsureDirectoryAndGetCreatedPathsAsync|0>d.MoveNext() in /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 64
--- End of stack trace from previous location ---
   at Medallion.Threading.ZooKeeper.ZooKeeperNodeCreator.CreateEphemeralSequentialNode(ZooKeeperConnection connection, ZooKeeperPath directory, String namePrefix, IEnumerable`1 aclEnumerable, Boolean ensureDirectoryExists) in /_/DistributedLock.ZooKeeper/ZooKeeperNodeCreator.cs:line 27
   at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func`2 hasAcquired, Func`4 waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in /_/DistributedLock.ZooKeeper/ZooKeeperSynchronizationHelper.cs:line 65
   at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func`2 hasAcquired, Func`4 waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in /_/DistributedLock.ZooKeeper/ZooKeeperSynchronizationHelper.cs:line 127
   at Medallion.Threading.ZooKeeper.ZooKeeperDistributedLock.Medallion.Threading.Internal.IInternalDistributedLock<Medallion.Threading.ZooKeeper.ZooKeeperDistributedLockHandle>.InternalTryAcquireAsync(TimeoutValue timeout, CancellationToken cancellationToken) in /_/DistributedLock.ZooKeeper/ZooKeeperDistributedLock.cs:line 91
   at Medallion.Threading.Internal.DistributedLockHelpers.ThrowTimeoutIfNull[T](ValueTask`1 task, String object) in /_/DistributedLock.Core/Internal/DistributedLockHelpers.cs:line 142
   at Medallion.Threading.Internal.Helpers.Convert[TDerived,TBase](ValueTask`1 task, ValueTaskConversion _) in /_/DistributedLock.Core/Internal/Helpers.cs:line 24
madelson commented 1 year ago

@devlnull @MajeQafouri I thought it might make sense for me to add some additional verbose logging to the underlying ZooKeeper .NET package; then you could test your apps with the additional logging and we can see if that helps point out the root cause of the issue. Would either/both of you be willing to test in this way?

MattGhafouri commented 1 year ago

@devlnull @MajeQafouri I thought it might make sense for me to add some additional verbose logging to the underlying ZooKeeper .NET package; then you could test your apps with the additional logging and we can see if that helps point out the root cause of the issue. Would either/both of you be willing to test in this way?

Good idea, I'll test it

madelson commented 1 year ago

@MajeQafouri ok I just published the version with logging. Here's how to use it:

In your csproj, add the debug version like so:

<PackageReference Include="Medallion.ZooKeeperNetEx" Version="1.0.0-debug001" />

Then you can link the library's logging to your logging system via the DebugLog class:

ZooKeeperNetEx.DebugLog.LogLevel = TraceLevel.Verbose; // to enable full verbose logging
ZooKeeperNetEx.DebugLog.LogMessageEmitted += message => yourLogger.Log(message);

Let me know if you have any trouble getting it working!

MattGhafouri commented 1 year ago

@madelson I tried to use the Debug package, also the logLevel was set as you said. but I'm not sure have I set them properly or not, but still receiving connectionLoss exception

`
org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.
at org.apache.zookeeper.ZooKeeper.existsAsync(String path, Watcher watcher) in D:\ZooKeeper\src\csharp\src\ZooKeeperNetEx\zookeeper\ZooKeeper.cs:line 745
at Medallion.Threading.ZooKeeper.ZooKeeperSequentialPathHelper.GetNodeCreationTimeAsyn c(ZooKeeperConnection connection, String path) in

C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSequentialPathHelper.cs:li ne 145 at Medallion.Threading.ZooKeeper.ZooKeeperSequentialPathHelper.FilterAndSortAsync (String parentNode, IEnumerable1 childrenNames, Func2 getNodeCreationTimeAsync, String prefix, String alternatePrefix) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSequentialPathHelper.cs:li ne 59 at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func2 hasAcquired, Func4 waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in

C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSynchronizationHelper.cs:li ne 76 at Medallion.Threading.ZooKeeper.ZooKeeperSynchronizationHelper.TryAcquireAsync(Func2 hasAcquired, Func4 waitAsync, TimeoutValue timeout, CancellationToken cancellationToken, String nodePrefix, String alternateNodePrefix) in

C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperSynchronizationHelper.cs:li ne 132 at Medallion.Threading.ZooKeeper.ZooKeeperDistributedLock.Medallion.Threading.Internal.IInternalDistributedLock<Medallion.Th reading.ZooKeeper.ZooKeeperDistributedLockHandle>.InternalTryAcquireAsync (TimeoutValue timeout, CancellationToken cancellationToken) in C:\Users\mikea_000\Documents\Interests\CS\DistributedLock\DistributedLock.ZooKeeper\ZooKeeperDistributedLock.cs:line 97 at RGS.Jackpot.Infrastrucure.Common.ZookeeperLockProviderService.DoActionWithLockAsync(String lockKey, Func1 func) in /src/src/Infrastructure/RGS.Jackpot.Infrastrucure.Common/ZookeeperLockProviderService.cs:line 71

Even with one instance, it means there is no concurrent request for the Resource Key Let me know, if I'm doing something wrong

madelson commented 1 year ago

@MajeQafouri sorry if I wasn't clear. The new build isn't attempting to fix the issue, it's trying to give us more useful logs to potentially diagnose what is going on.

ZooKeeperNetEx.DebugLog.LogMessageEmitted += message => yourLogger.Log(message);

This line pipes the zookeeper logs into whatever logging system you have. You can then grab those logs and investigate and/or post them here.

MattGhafouri commented 1 year ago

@madelson I believe I did it, please check the log settings that I've done

`

     builder.Logging.ClearProviders();
     builder.Logging.SetMinimumLevel(LogLevel.Trace);

    ZooKeeperNetEx.DebugLog.LogLevel = TraceLevel.Verbose; // to enable full verbose logging        
    builder.Host.UseNLog();

     ZooKeeperNetEx.DebugLog.LogMessageEmitted += message => logger.Log(LogLevel.Trace, message);`

Is it right?

madelson commented 1 year ago

@MajeQafouri that code looks reasonable to me. Are you getting any logs from this? If so can you share?

MattGhafouri commented 1 year ago

No, that's the only type of exception and log detail which I receive.

madelson commented 1 year ago

@MajeQafouri that's strange. I see that you're currently logging to your own logger using LogLevel.Trace. Is it possible you're not enabling trace logs?

Could you set a break point inside the LogMessageEmitted event handler and see if it is getting called?

madelson commented 1 year ago

@MajeQafouri any luck getting the logs? I just did a quick test like this:

        ZooKeeperNetEx.DebugLog.LogLevel = System.Diagnostics.TraceLevel.Verbose;
        ZooKeeperNetEx.DebugLog.LogMessageEmitted += m => Console.WriteLine(m);
        var @lock = new ZooKeeperDistributedLock(Guid.NewGuid().ToString(), connectionString);
        await using (var handle = await @lock.AcquireAsync())
        {
        }

And I get logs like this:

Info ZooKeeper Initiating client connection, connectString=127.0.0.1:2181 sessionTimeout=20000 watcher=Medallion.Threading.ZooKeeper.ZooKeeperConnection+ConnectionWatcher sessionId=0 sessionPasswd=<hidden>
Verbose DynamicHostProvider Trying to resolve at least one IP from hosts:{127.0.0.1:2181}
Verbose DynamicHostProvider Resolving Hosts={127.0.0.1:2181}
Verbose DynamicHostProvider Resolved Hosts={127.0.0.1:2181} to {{127.0.0.1:2181}}
Info ClientCnxn Opening socket connection to server {127.0.0.1:2181}
...
MattGhafouri commented 1 year ago

finally, I managed to put the logger, but in Verbose log level there was not any detail about the exception, Just two type of log I've got :

response::org.apache.zookeeper.proto.CreateResponse Verbose ClientCnxn Reading reply sessionid:0x1087668dd3f000e, packet:: clientPath: serverPath: finished:False header::org.apache.zookeeper.proto.RequestHeaderreplyHeader::org.apache.zookeeper.proto.ReplyHeader request::org.apache.zookeeper.proto.CreateRequest response::org.apache.zookeeper.proto.CreateResponse Verbose ClientCnxn Reading reply sessionid:0x1087668dd3f000e, packet:: clientPath: serverPath: finished:False header::org.apache.zookeeper.proto.RequestHeaderreplyHeader::org.apache.zookeeper.proto.ReplyHeader request::org.apache.zookeeper.proto.CreateRequest

or

Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True Verbose SocketContext State=0, LastOp=Receive, Error=Success, Available=16, Connected=True

madelson commented 1 year ago

@MajeQafouri actually this is what I'm looking for. Any chance you can upload a log file?

MattGhafouri commented 1 year ago

got it, but I'm receiving just these two types of logs, not errors(I mean inside the LogMessageEmitted event), but inside the application sometimes the locks cannot be acquired, but the rate is quite too much in comparison to other Distributed locks, cause I'm also using other distributed locks, like Redis.

madelson commented 1 year ago

but I'm receiving just these two types of logs, not errors(

So do the logs you posted account for 100% of the unique lines? That would surprise me since my test generated some different log types. Note that I'm not just interested in error logs, I want to see if we can understand what is happening on your system.

inside the application sometimes the locks cannot be acquired

As an experiment, can you try something like this to see if it improves reliability? Again, I'm just trying to understand where the problem lies:

// use this routine to acquire your ZooKeeper locks instead of calling TryAcquireAsync() directly
public static async ValueTask<ZooKeeperDistributedLockHandle?> TryAcquireAsyncWithRetry(
    ZooKeeperDistributedLock @lock,
    TimeSpan? timeout = null,
    CancellationToken cancellationToken = default);
{
    var retriesRemaining = 3;
    while (true)
    {
        try
        {
            return await @lock.TryAcquireAsync(timeout, cancellationToken);
        }
        catch (org.apache.zookeeper.KeeperException.ConnectionLossException ex) 
            when (retriesRemaining > 0)
        {
            // maybe issue a log here that we're retrying?
            --retriesRemaining;
        }
    } 
}
MattGhafouri commented 1 year ago

No problem, I'll follow the approach, but unfortunately for now, I don't have access to the server, keep you posted as soon as I apply the changes.

madelson commented 1 year ago

@MajeQafouri / @devlnull any updates here? Any success with collecting logs or using retry?

MattGhafouri commented 1 year ago

Hey, I'm afraid, Not yet.

dansuse commented 1 year ago

Hi @madelson , I am facing the same issue, even after using the https://www.nuget.org/packages/DistributedLock.ZooKeeper/1.0.1-alpha001. Is it possible to open source the code for above nuget package? This is the stack trace from my application.

stack-trace
madelson commented 1 year ago

Is it possible to open source the code for above nuget package?

@dansuse yes this code is all available; I'd love to have more eyes on it/contributors since I don't have an environment where I can reproduce this.

The only change within DistributedLock itself is to make the alpha package is just swapping out the reference to ZooKeeperNetEx with my fork of that package:

-    <PackageReference Include="ZooKeeperNetEx" Version="3.4.12.4" />
+    <PackageReference Include="Medallion.ZooKeeperNetEx" Version="1.0.0-debug001" />

The changes I've made on my fork are on this branch. You can see the diff here. The only real change is in SocketContext.cs.

If you have a chance to play around and have any thoughts on how to further improve the fork, I'd love to collaborate on this.

One thing I saw in the ZooKeeper docs is:

You must test ZooKeeper server failures. The ZooKeeper service can survive failures as long as a majority of servers are active. The question to ask is: can your application handle it? In the real world a client's connection to ZooKeeper can break. (ZooKeeper server failures and network partitions are common reasons for connection loss.) The ZooKeeper client library takes care of recovering your connection and letting you know what happened, but you must make sure that you recover your state and any outstanding requests that failed. Find out if you got it right in the test lab, not in production - test with a ZooKeeper service made up of a several of servers and subject them to reboots.

I wonder if this implies that we should be catching this connection loss exception and just waiting for the client to recover (but for how long?). Without the ability to reproduce this easily in my environment its been hard for me to validate. I also suspect there is more to it since people didn't start facing issues with this until more recent versions of .NET and it seems like people who are using the ZooKeeperNetEx package directly (vs through DistributedLock) are also struggling with this...

Jetski5822 commented 1 year ago

@madelson we have just tested your change locally and in a K8 cluster and that code change fixed the issue - could you issue a PR for this change against the main repo?

madelson commented 12 months ago

@Jetski5822 I filed https://github.com/shayhatsor/zookeeper/pull/54 , but the maintainer has not been very active so I don't anticipate it being merged. If I were to publish a stable version of my Medallion.ZooKeeperNetEx fork and released a stable version of DistributedLock.ZooKeeper which used that would that be valuable to you?

madelson commented 12 months ago

Has anyone having these issues tried the https://www.nuget.org/packages/ZooKeeper.Net package? Seems like another variant on ZooKeeperNetEx that was published a bit more recently.