python-zk / kazoo

Kazoo is a high-level Python library that makes it easier to use Apache Zookeeper.
https://kazoo.readthedocs.io
Apache License 2.0
1.3k stars 386 forks source link

Non-blocking lock acquisition failure can "leak" the ephemeral lock node. #732

Open bmahler opened 9 months ago

bmahler commented 9 months ago

A bit about our setup for context: We use znodes as representation of work items (typically there are hundreds of work items / znodes present), and we have many workers (e.g. 800) constantly trying to lock one of the work znodes via the Lock class. If the worker obtains the lock, it holds it, performs the work (which takes quite some time), then releases the lock. The work loop in each worker looks something like this:

while True:
  children = zk_client.get_children(path)
  shuffle(children)
  for child in children:
    lock = zk_client.Lock(child)
    if not lock.acquire(blocking=False):
      continue
    # do work to process child
    lock.release()
    if work_finished:
      zk_client.delete(child)
  sleep(5)

As you can see, we stress Lock quite heavily, the typical load is something like O((800-300) idle workers * 300 znodes / 5 seconds) == 30,000 lock acquisition attempts per second. What happens the vast majority of the time when the lock is already acquired is that the failure to acquire the lock creates a temporary znode, and then this node gets deleted when not_gotten by calling _best_effort_cleanup(): https://github.com/python-zk/kazoo/blob/2.8.0/kazoo/recipe/lock.py#L219-L220

However, we have observed that on occasion, the following occurs:

  1. A worker obtains the ephemeral lock node successfully.
  2. Some time later, e.g. seconds, a different worker creates another ephemeral lock node that never gets cleaned up! This worker experienced a False from lock.acquire, we know this because if lock.acquire succeeds a message gets logged, and in all instances of the issue we never see the log message. This second worker moves on and processes other runs of the loop.
  3. This second lock node never gets deleted! Presumably this is because we don't experience a session expiration.

The current theory is that in _best_effort_cleanup() hits a KazooException without the session expiring, since it doesn't handle the exception, hitting it without a session expiration will leak the ephemeral lock znode in the same way that we've observed.

The fix here seems to be to handle / retry the exception within _best_effort_cleanup(). Should this really be best effort? Alternatively, is there something we should do on our end? E.g. perhaps the deletion within _best_effort_cleanup() experiences a SUSPENDED KazooState? Let me know if additional info would be helpful.

We use kazoo 2.8.0.