madelson / DistributedLock

A .NET library for distributed synchronization
MIT License
1.86k stars 192 forks source link

ZooKeeperNetEx connection loss issue: an acquired lock seems is not released #156

Open MattGhafouri opened 1 year ago

MattGhafouri commented 1 year ago

I've implemented a lock with the Zookeeper with this configuration :

  1. DistributedLock.ZooKeeper - Version="1.0.0"
  2. dotnet version 6.0
  3. Hosted on K8s (one pod, there is no concurrent request)
  4. Zookeeper server configuration on K8s :

version: "3.9" services: zk1: container_name: zk1 hostname: zk1 image: bitnami/zookeeper:3.8.0-debian-11-r57 ports:

  • 2181:2181 environment:
  • ALLOW_ANONYMOUS_LOGIN=yes
  • ZOO_SERVER_ID=1
  • ZOO_SERVERS=0.0.0.0:2888:3888
  • ZOO_MAX_CLIENT_CNXNS=500

There are several worker services inside the application, each of them working with a different lock key. periodically it tries to accuqire the lock and do some processes. It seems they are working without problem, but after a while, I get this exception Locking failed.Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown. org.apache.zookeeper.KeeperException+ConnectionLossException: Exception of type 'org.apache.zookeeper.KeeperException+ConnectionLossException' was thrown.

It seems the lock cannot be acquired because it has not been released, although there is no concurrent request for the lock key.

The LockService code in dotnet :

    `
     private TimeSpan _connectionTimeoutInSecond = TimeSpan.FromSeconds(30);
     private TimeSpan _waitingForLockInSecond = TimeSpan.FromSeconds(30);
     public async Task<LockProcessResult> DoActionWithLockAsync(string lockKey, Func<Task> func)
       {
      var processResult = new LockProcessResult();
      try
      {
        var @lock = new ZooKeeperDistributedLock(lockKey, _configuration.ConnectionString, opt =>
        {
            opt.ConnectTimeout(_connectionTimeoutInSecond);
        });

        await using (var handle = await @lock.TryAcquireAsync(timeout: _waitingForLockInSecond))
        {
            if (handle != null)
            {
                // I have the lock 
                await func(); 
            }
            else
            {
                processResult.SetException(new LockAcquisitionFailedException(lockKey)); 
            }
        }

     }
     catch (Exception ex)
     {
        //I got the exceptions here
        processResult.SetException(ex); 
     }

     return processResult;
 }`

I appreciate any suggestion

BoutemineOualid commented 7 months ago

Same issue here running zookeeper in a docker container, the alpha release seems to have fixed the issue.

madelson commented 6 months ago

The Vostok.ZooKeeper.Client package was last published in 2022 and has a decent number of downloads. Has anyone tried it?

For context, while I can move forward with my patched fork of ZooKeeperNetEx, I'd love to find a higher-quality alternative to rely on going forward since I'm not in position to maintain a ZooKeeper client.

Jetski5822 commented 6 months ago

That used ZooKeeperEx under the hood too :(

madelson commented 6 months ago

I think the really solve this we need to be handling ConnectionLost explicitly like they do here and as described here.

Perhaps with that fully in place we don't even need ZooKeeperNetEx patch

Jetski5822 commented 4 months ago

@madelson Where I work, we have a custom ZK build which fixes this and also takes in some of the code you use to manage the ZooKeeper instance, given our potential reliance on ZK - it might be worth us publishing it; Ill have a chat internally an see if thats possible, then you could rely on that.