Zookeeper clients contending for distributed lock get into deadlock state during network partition failure

This is the edge case that causes zk clients contending for a lock to enter into stall state. This is occurring under very complex sequence of events in which network partitioning is playing a key role. Imagine that there are two zk clients contending for a lock. One zk client requests for a lock and the library successfully creates a znode for this request under the parent path; however the response is lost as suddenly there is a network partition issue. Client library senses zk connection close exception (as socket is closed) and there is no trace of the lock. This is at the time when ZK library is waiting for the response from zk on current children under the path. Because this provides information about how many znodes (sequence numbers) are there in the priority queue and request with lowest sequence number acquires the lock. However because of server connection close issue, the library exits abruptly without populating lock_path in the lock object. However, on Zookeeper server lock gets created as new znode is successfully got created in the priority queue. This is the time at which lock has got created, but there is no trace of it. This is the cause of the deadlock. Subsequently both the clients keep on trying to acquire a lock and it results into deadlock as for the Zookeeper cluster, a lock exists but Zookeeper clients are not aware of the same. There is no explicit unlock or session timeout because of which lock remains active. In GoLang library there is a blocking call using channel that causes both the clients to get into deadlock.

samuel / go-zookeeper

Zookeeper clients contending for distributed lock get into deadlock state during network partition failure #229