hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.24k stars 4.41k forks source link

Need a better way to handle lock acquisition failure #2767

Open jyoon17 opened 7 years ago

jyoon17 commented 7 years ago

Please see #1008 (https://github.com/hashicorp/consul/pull/1008#issuecomment-110226523)

The new behavior is:

  1. 3 agents would contend for the lock.
  2. 1 would get it.
  3. 2, 3 get an false return from kv.Acquire
  4. 2, 3 immediately do a non-blocking read on the key.
  5. 2, 3 see a session ID on the key, jump back to a blocking read (WAIT).
  6. After 1 second, 1 unlocks the key.
  7. 2, 3 contend for the key immediately since the blocking read returned.
  8. 2 acquires it, 3 gets a false return from kv.Acquire.
  9. 3 immediately does a non-blocking read on the key.
  10. 3 sees a session ID on the key and jumps back to a blocking read (WAIT).
  11. 2 does its business, releases the lock after 1 second.
  12. 3 contends for and acquires the lock due to the blocking read returning immediately. Total time passed: 2 seconds.

Let's assume that the first agent releases the lock somewhere between 3 and 4. Then agent 2 and 3 see a blank session on the key which leads to make them sleep 5 seconds to avoid a hot-loop.

Would it be better to try to acquire a lock again instead of a long sleep for nothing? Or it would be much better if a server could provide additional info of key which is in a lock-delay state. A blank session seems too vague.

slackpad commented 7 years ago

Hi @jyoon17 interesting - it does seem like we could add some lock delay info as feedback to make the wait less of an open loop thing.

slackpad commented 7 years ago

Adding the feedback could be tricky though, because the leader maintains the lock delay timers, so you'd have to read that in a consistent way.