Acquiring a new lock randomly waits 0 or 1 second for nothing

n-a-sz commented 4 years ago

I'm using the read-write lock API of kazoo 2.6.1 and noticed it randomly takes one sec longer to acquire a free lock, while it sends and receives the very same messages to Zookeeper

I have found, that it always hits this line where it decides to sleep 0 or 1 seconds, and I cannot modify this behavior.

Sleeping 0 seconds:

2019-12-17 12:09:38.189 kazoo.client: DEBUG - Sending request(xid=30): Exists(path='/lock-test-xzy/test-resource', watcher=None)
2019-12-17 12:09:38.191 kazoo.client: DEBUG - Received response(xid=30): ZnodeStat(czxid=21475208772, mzxid=21475208772, ctime=1574778158448, mtime=1574778158448, version=0, cversion=30, aversion=0, ephemeralOwner=0, dataLength=0, numChildren=0, pzxid=25772738990)
2019-12-17 12:09:38.192 kazoo.client: DEBUG - Sending request(xid=31): Create(path='/lock-test-xzy/test-resource/4c63df41cb5d47dc8a4429c268e7ef36__rlock__', data=b'xzy', acl=[ACL(perms=31, acl_list=['ALL'], id=Id(scheme='world', id='anyone'))], flags=3)
2019-12-17 12:09:38.195 kazoo.client: DEBUG - Received response(xid=31): '/lock-test-xzy/test-resource/4c63df41cb5d47dc8a4429c268e7ef36__rlock__0000000015'
2019-12-17 12:09:38.196 kazoo.client: DEBUG - Sending request(xid=32): GetChildren(path='/lock-test-xzy/test-resource', watcher=None)
2019-12-17 12:09:38.198 kazoo.client: DEBUG - Received response(xid=32): ['4c63df41cb5d47dc8a4429c268e7ef36__rlock__0000000015']
exception happened: <class 'kazoo.retry.ForceRetryError'> [log after catching the exception here](https://github.com/python-zk/kazoo/blob/master/kazoo/retry.py#L132)
going to sleep 0 [log before sleeping here](https://github.com/python-zk/kazoo/blob/master/kazoo/retry.py#L156)
2019-12-17 12:09:38.199 kazoo.client: DEBUG - Sending request(xid=33): GetChildren(path='/lock-test-xzy/test-resource', watcher=None)
2019-12-17 12:09:38.200 kazoo.client: DEBUG - Received response(xid=33): ['4c63df41cb5d47dc8a4429c268e7ef36__rlock__0000000015']
2019-12-17 12:09:38.201 kazoo.client: DEBUG - Sending request(xid=34): GetChildren(path='/lock-test-xzy/test-resource', watcher=None)
2019-12-17 12:09:38.202 kazoo.client: DEBUG - Received response(xid=34): ['4c63df41cb5d47dc8a4429c268e7ef36__rlock__0000000015']

Sleeping 1 seconds:

2019-12-17 12:19:10.944 kazoo.client: DEBUG - Sending request(xid=37): Exists(path='/lock-test-xzy/test-resource', watcher=None)
2019-12-17 12:19:10.945 kazoo.client: DEBUG - Received response(xid=37): ZnodeStat(czxid=21475208772, mzxid=21475208772, ctime=1574778158448, mtime=1574778158448, version=0, cversion=32, aversion=0, ephemeralOwner=0, dataLength=0, numChildren=0, pzxid=25772738998)
2019-12-17 12:19:10.946 kazoo.client: DEBUG - Sending request(xid=38): Create(path='/lock-test-xzy/test-resource/9eb4cf7e323b4d2582b5e45e5654ea0e__rlock__', data=b'xzy', acl=[ACL(perms=31, acl_list=['ALL'], id=Id(scheme='world', id='anyone'))], flags=3)
2019-12-17 12:19:10.950 kazoo.client: DEBUG - Received response(xid=38): '/lock-test-xzy/test-resource/9eb4cf7e323b4d2582b5e45e5654ea0e__rlock__0000000016'
2019-12-17 12:19:10.951 kazoo.client: DEBUG - Sending request(xid=39): GetChildren(path='/lock-test-xzy/test-resource', watcher=None)
2019-12-17 12:19:10.952 kazoo.client: DEBUG - Received response(xid=39): ['9eb4cf7e323b4d2582b5e45e5654ea0e__rlock__0000000016']
exception happened: <class 'kazoo.retry.ForceRetryError'> [log after catching the exception here](https://github.com/python-zk/kazoo/blob/master/kazoo/retry.py#L132)
going to sleep 1 [log before sleeping here](https://github.com/python-zk/kazoo/blob/master/kazoo/retry.py#L156)
2019-12-17 12:19:11.954 kazoo.client: DEBUG - Sending request(xid=40): GetChildren(path='/lock-test-xzy/test-resource', watcher=None)
2019-12-17 12:19:11.956 kazoo.client: DEBUG - Received response(xid=40): ['9eb4cf7e323b4d2582b5e45e5654ea0e__rlock__0000000016']
2019-12-17 12:19:11.956 kazoo.client: DEBUG - Sending request(xid=41): GetChildren(path='/lock-test-xzy/test-resource', watcher=None)
2019-12-17 12:19:11.958 kazoo.client: DEBUG - Received response(xid=41): ['9eb4cf7e323b4d2582b5e45e5654ea0e__rlock__0000000016']

As you can see, there is no difference in the communication, only in sleep time. It looks like a bug to me. Also, I cannot understand why it sends the very same GetChildren request 3 times? Is it because I have a three node cluster?

StephenSorriaux commented 4 years ago

Hello,

This is due to the retry. It seems to me this should be solved by https://github.com/python-zk/kazoo/pull/578 where the only whole seconds while retrying was identified. Can you share a snippet?

Thanks.

n-a-sz commented 4 years ago

Hi StephenSorriaux, thanks for the fast reply.

I have further debugged it and could reproduce it by not adding a starting slash to the path. In this case, it adds a starting slash to the path, but it cannot find the node first, but will retry again and will find it on the second time after waiting a bit.

Here is the code to reproduce:

import time
from kazoo.client import KazooClient
zk = KazooClient(hosts=ZOO_KEEPER_HOSTS)
zk.start()
times = []
for i in range(0,5):
    # read_lock = zk.ReadLock('/lock-test-xyz/test-resource', 'xzy')  # works without problem
    read_lock = zk.ReadLock('lock-test-xyz/test-resource', 'xzy')  # this line throws an exception inside, but works after kazoo retries
    start = time.time()
    read_lock.acquire()
    times.append(time.time() - start)
    read_lock.release()
zk.stop()
zk.close()
print(times)

Annotating this line with prints, that throws the exception, I got this:

With starting slash:

getting index for 8c148f894f394fb3abe2f69f79a3a1e3__rlock__0000000076
children: ['8c148f894f394fb3abe2f69f79a3a1e3__rlock__0000000076']

Without starting slash:

getting index for /d6895f8e89214abfb98c180e60473af8__rlock__0000000081
children: ['d6895f8e89214abfb98c180e60473af8__rlock__0000000081']
... waiting 0 or 1 sec
getting index for d6895f8e89214abfb98c180e60473af8__rlock__0000000081
children: ['d6895f8e89214abfb98c180e60473af8__rlock__0000000081']

So it's probably my bad not giving a starting slash. But without it does it make any sense? If not, kazoo could add it for me.

python-zk / kazoo

Acquiring a new lock randomly waits 0 or 1 second for nothing #581