nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

Explore & Optimize: Ensuring Effective Log Storage Management for the Observability Cluster #505

Open schwesig opened 8 months ago

schwesig commented 8 months ago

Ensuring Effective Log Storage Management for the Observability Cluster

timeline:

Background:

We have successfully set up an OpenShift cluster dedicated to observability (obs cluster), where log files are collected and stored using Loki and subsequently mirrored to an S3 storage for extended retention. As we aim to maintain a robust observability framework, it is crucial to address and not forget a few topics.

Objectives:

Discussion Points:

  1. Storage Capacity Planning: Assess the required storage size to accommodate at least one year's worth of log data. This assessment may rely on initial estimates and adjust based on actual usage patterns observed over the first year. Assign an issue to monitor storage trends and plan for capacity adjustments as needed. (Alerts?!)
  2. Storage Limit Enforcement: Ensuring the S3 archive does not exceed budget allocations or planned capacity. This may involve setting quotas or using AWS/S3 native features to monitor and cap storage usage.
  3. Data Lifecycle Management:
    • xx% Capacity Threshold: Define a protocol for handling scenarios when storage utilization reaches xx%, including potential data lifecycle actions such as archiving (another storage) or deletion of older logs.
    • Lifecycle Tools/Strategies: Investigate tools and mechanisms available within AWS, NooBaa, Loki, or other technologies that support data lifecycle management, focusing on capabilities to automatically expire or migrate data based on age, size, or other criteria.
    • GitOps for Configuration Management: Assess the feasibility of managing lifecycle policies and storage configurations through GitOps practices, or automated changes to the storage and data management configurations.

This issue aims to keep an eye on that topic, and developing a sustainable and cost-effective log storage strategy for the Observability Cluster, ensuring compliance with retention requirements while maintaining energy preserving, operational efficiency, and cost control for a best practice rolemodel idea.

schwesig commented 8 months ago

started in a meeting /cc @computate @bnshr @harshil-codes @jbasu01

We made a general calculation, how much storage we need for a year (requirements). This storage size can be too low or high.

Therefore:

schwesig commented 8 months ago

ideas and links from the meeting

https://access.redhat.com/documentation/en-us/red_hat_openshift_container_storage/4.7/html/managing_hybrid_and_multicloud_resources/bucket-policies-in-the-multicloud-object-gateway

https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutBucketLifecycleConfiguration.html

https://nasa.github.io/cumulus/docs/configuration/lifecycle-policies/

https://github.com/noobaa/noobaa-core/blob/d03bfb2c5a3bef6e71610b80ee34dfe785bd81f4/src/api/bucket_api.js#L592-L628

https://github.com/noobaa/noobaa-core/blob/d03bfb2c5a3bef6e71610b80ee34dfe785bd81f4/src/api/common_api.js#L145-L208

Here is an issue where we listed several noobaa and aws commands that we have used on buckets. https://github.com/nerc-project/operations/issues/75

{
    "Rules": [
        {
            "ID": "DeleteAfter365Days",
                  "Expiration": {
                "Days": 365
            },
            "Status": "Enabled"
        }
    ]
}