Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.
Recently we have replaced our distributed locks and global timers to use etcd's concurrency API to guarantee active-active HA.
However, there are still edge cases that require a global coordination of all manager processes, such as rate-limited container registry access (e.g., Docker Hub with anonymous user). Since there are many manager processes that receives the API requests in a load-balanced fashion, it is difficult to share the rate-limit states between different manager processes. This is why lablup/backend.ai-manager#501 is on hold.
Let's localize such globally coordinated states to a single manager process, or a leader.
To keep high availability, we should perform periodic checks on the liveness the leader and re-elect it, and fortunately etcd provides the facilities to implement this.
[x] manager: Implement leader election of manager processes with periodic leader status checks.
[x] manager: Rewrite global timer to run on the leader manager process. (When a new leader is elected, the new one should start global timers while the old one should stop, of course when the old one is still alive.)
[ ] manager: Add a generic "leader task" message queue based on Redis stream to reroute API requests accepted by arbitrary manager processes that should be exclusively processed by the leader
[ ] manager: Rewrite lablup/backend.ai-manager#501 to use a local aiolimiter state to implement its own rate-limiting to the container registries. Use the leader task queue to trigger the rescan task.
It seems that Etcd Election API is only for their own cluster. Therefore, I wrote aioraft-ng which can run in standalone mode and can be attached to our backend.ai manager.
Recently we have replaced our distributed locks and global timers to use etcd's concurrency API to guarantee active-active HA.
However, there are still edge cases that require a global coordination of all manager processes, such as rate-limited container registry access (e.g., Docker Hub with anonymous user). Since there are many manager processes that receives the API requests in a load-balanced fashion, it is difficult to share the rate-limit states between different manager processes. This is why lablup/backend.ai-manager#501 is on hold.
Let's localize such globally coordinated states to a single manager process, or a leader. To keep high availability, we should perform periodic checks on the liveness the leader and re-elect it, and fortunately etcd provides the facilities to implement this.
aiolimiter
state to implement its own rate-limiting to the container registries. Use the leader task queue to trigger the rescan task.