lablup / backend.ai

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.
https://www.backend.ai
GNU Lesser General Public License v3.0
502 stars 151 forks source link

Leader election-based distributed timer and image rescan rate limiting #415

Open achimnol opened 2 years ago

achimnol commented 2 years ago

Recently we have replaced our distributed locks and global timers to use etcd's concurrency API to guarantee active-active HA.

However, there are still edge cases that require a global coordination of all manager processes, such as rate-limited container registry access (e.g., Docker Hub with anonymous user). Since there are many manager processes that receives the API requests in a load-balanced fashion, it is difficult to share the rate-limit states between different manager processes. This is why lablup/backend.ai-manager#501 is on hold.

Let's localize such globally coordinated states to a single manager process, or a leader. To keep high availability, we should perform periodic checks on the liveness the leader and re-elect it, and fortunately etcd provides the facilities to implement this.

rapsealk commented 2 years ago

It seems that Etcd Election API is only for their own cluster. Therefore, I wrote aioraft-ng which can run in standalone mode and can be attached to our backend.ai manager.