Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.
When db or etcd is terminated, the manager continues to retry the connection. However, in this case, an exception message is displayed instead of a log.
When the redis connection is terminated and the manager retries the connection, there is no log indicating when the connection is successfully restored. It is difficult to explicitly know the time of redis reconnection.
Tasks
Ensure that an exception is logged when PostgreSQL or etcd connection retries fail.
When a redis request is retried and successfully completes, log the successful retry event as an info log.
I think it would be better to run a ping request periodically to track the status of the redis server.
Expected Results
Consistent logging of both failure and recovery events, making it easier to track the issues and their resolutions.
Main idea
References
Motivation
Tasks
Expected Results
Alternative ideas
No response
Anything else?
No response