Leader election-based distributed timer and image rescan rate limiting

Recently we have replaced our distributed locks and global timers to use etcd's concurrency API to guarantee active-active HA.

However, there are still edge cases that require a global coordination of all manager processes, such as rate-limited container registry access (e.g., Docker Hub with anonymous user). Since there are many manager processes that receives the API requests in a load-balanced fashion, it is difficult to share the rate-limit states between different manager processes. This is why lablup/backend.ai-manager#501 is on hold.

Let's localize such globally coordinated states to a single manager process, or a leader. To keep high availability, we should perform periodic checks on the liveness the leader and re-elect it, and fortunately etcd provides the facilities to implement this.

[x] manager: Implement leader election of manager processes with periodic leader status checks.
[x] manager: Rewrite global timer to run on the leader manager process. (When a new leader is elected, the new one should start global timers while the old one should stop, of course when the old one is still alive.)
[ ] manager: Add a generic "leader task" message queue based on Redis stream to reroute API requests accepted by arbitrary manager processes that should be exclusively processed by the leader
[ ] manager: Rewrite lablup/backend.ai-manager#501 to use a local aiolimiter state to implement its own rate-limiting to the container registries. Use the leader task queue to trigger the rescan task.

lablup / backend.ai

Leader election-based distributed timer and image rescan rate limiting #415