lablup / backend.ai

Backend.AI is a streamlined, container-based computing cluster platform that hosts popular computing/ML frameworks and diverse programming languages, with pluggable heterogeneous accelerator support including CUDA GPU, ROCm GPU, TPU, IPU and other NPUs.
https://www.backend.ai
GNU Lesser General Public License v3.0
513 stars 152 forks source link

Improve container log collection #3007

Open HyeockJinKim opened 4 hours ago

HyeockJinKim commented 4 hours ago

References

Motivation

Currently, there is a known issue with log collection where excessive log size leads to connection errors when querying session logs in the web UI. This issue requires urgent attention, especially as the volume of logs can increase rapidly when using model services. While proactive measures such as implementing log rotation to remove old log files or applying pagination to limit the amount of data requested at one time can be helpful, we aim to fundamentally improve the system by collecting logging information in a location separate from the agent. This separation allows for enhanced scalability, as it reduces the load on the agent and prevents potential bottlenecks. Moreover, it enables better log management and analysis, as external logging servers can provide more robust features for storage, retrieval, and querying of logs, facilitating quicker access to critical information during troubleshooting and analysis.

Main Tasks

I want to adopt a method where collected logs are not stored in the agent but instead forwarded to an external logging server. Ideally, I would like to leverage existing open-source projects rather than implementing this from scratch.

I suggest that we focus on the following tasks:

  1. Implement log rotation to prevent excessive accumulation of logs.
  2. Analyze and compare open-source tools for log collection (such as Fluent Bit, Logstash, Vector, etc.) to determine which tool is most suitable for backend.ai.
    • Ensure that the log collection tool allows for easy configuration changes to the storage location of the logs.
  3. Identify potential changes to the installation guidelines and, if possible, set up the ability to apply new features in a configurable manner.
  4. (Optional) Consider implementing visualization tools for statistical metrics or problem analysis in the future.

Expected Results

achimnol commented 3 hours ago

Technical considerations:

Ideas: