lablup / backend.ai

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.
https://www.backend.ai
GNU Lesser General Public License v3.0
502 stars 150 forks source link

Health metric API to expose Redis connection pool metrics #2522

Open achimnol opened 1 month ago

achimnol commented 1 month ago

Let's add a Prometheus-compatible API endpoint to expose the health metrics of Redis connection pools used by each manager process.

The API handler itself should be very simple to implement. Since we have a multi-node multi-process architecture for Manager, we should use an external storage (Redis) to aggregate the metrics from different manager processes, and adopt a separate Redis connection mechanism like #2041 to avoid interference with the monitored connection pool.

The metric may be composed of:

achimnol commented 1 month ago
👀 #2567 feat: Add manager stat API that compatible with Prometheus +221/-0
👀 #2566 feat: Add manager status report bgtask +48/-3
👀 #2565 feat: Introduce models.manager module that treats manager status +142/-0