Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.
The API handler itself should be very simple to implement. Since we have a multi-node multi-process architecture for Manager, we should use an external storage (Redis) to aggregate the metrics from different manager processes, and adopt a separate Redis connection mechanism like #2041 to avoid interference with the monitored connection pool.
Let's add a Prometheus-compatible API endpoint to expose the health metrics of Redis connection pools used by each manager process.
The API handler itself should be very simple to implement. Since we have a multi-node multi-process architecture for Manager, we should use an external storage (Redis) to aggregate the metrics from different manager processes, and adopt a separate Redis connection mechanism like #2041 to avoid interference with the monitored connection pool.
The metric may be composed of: