quickwit-oss / quickwit

Cloud-native search engine for observability. An open-source alternative to Datadog, Elasticsearch, Loki, and Tempo.
https://quickwit.io
Other
7.83k stars 315 forks source link

Add retention policy #1265

Closed fulmicoton closed 2 years ago

fulmicoton commented 2 years ago

We want to let users define a retention policy associated to an index.

The retention policy should periodically delete splits that go out of retention (as part of the GC actor or new specific one).

Related #403

guilload commented 2 years ago

A retention policy defines a period of time after which splits are deleted.

Configuration

Examples of retention in various systems

Setting a retention policy can be fairly simple or complex depending on the system. For Amazon CloudWatch or InfluxDB, it just boils down to defining a retention period with one line of configuration. More complex backends fold retention policies into a more generic rule system that configures actions or transitions that are conditionally applied depending on the properties of the managed objects. Those actions can be moving data between storage tiers, downsampling (rollup), or deleting data. OpenSearch even features a rule engine that is able to send notifications when a rule is applied!

Amazon CloudWatch (terraform)

resource "aws_cloudwatch_log_group" "yada" {
  name = "Yada"
  retention_in_days = 14
}

Reference

Amazon S3 (terraform)

resource "aws_s3control_bucket_lifecycle_configuration" "example" {
  bucket = aws_s3control_bucket.example.arn

  rule {
    expiration {
      days = 365
    }

    filter {
      prefix = "logs/"
    }

    id = "logs"
  }

  rule {
    expiration {
      days = 7
    }

    filter {
      prefix = "temp/"
    }

    id = "temp"
  }
}

Reference

InfluxDB

CREATE RETENTION POLICY "a_year" ON "food_data" DURATION 52w REPLICATION 1;

Reference

OpenSearch

{
  "policy": {
    "description": "hot warm delete workflow",
    "default_state": "hot",
    "schema_version": 1,
    "states": [
      {
        "name": "hot",
        "actions": [
          {
            "rollover": {
              "min_index_age": "1d"
            }
          }
        ],
        "transitions": [
          {
            "state_name": "warm"
          }
        ]
      },
      {
        "name": "warm",
        "actions": [
          {
            "replica_count": {
              "number_of_replicas": 5
            }
          }
        ],
        "transitions": [
          {
            "state_name": "delete",
            "conditions": {
              "min_index_age": "30d"
            }
          }
        ]
      },
      {
        "name": "delete",
        "actions": [
          {
            "notification": {
              "destination": {
                "chime": {
                  "url": "<URL>"
                }
              },
              "message_template": {
                "source": "The index {{ctx.index}} is being deleted"
              }
            }
          },
          {
            "delete": {}
          }
        ]
      }
    ]
  }
}

Reference

Proposed configuration for Quickwit

With that in mind and without surprise, for this iteration, I'm going to suggest adding a simple property named retention_period in the index config and supporting a user-friendly string-based format for expressing its duration.

index_id: my-index
retention_period: 14 days

Implementation pointers

Retention period format

We want to support hour(s), day(s), week(s), month(s), year(s). Do we want to supports minute(s)?

^([0-9]+)\s*(hour|day|week|month|year)s?$

GC actor

For each index, the GC actor can be modified to check at the appropriate interval of time whether some splits should be deleted. The actor should also check periodically the index config to detect configuration changes. The retention period should ideally be applied from the publication time but since we currently do not store this information, the creation time will do fine. However, I do see pathological cases where this can behave unexpectedly (very short retention period and long delay between stage and publish). The GC actor currently runs every minute so retention policies should be applied rapidly, even though I personally find that interval a bit aggressive. It'll be nice to eventually move away from this polling-based strategy and switch to a notification-based strategy.

fulmicoton commented 2 years ago

The tier configuration a-la-splunk or elastic might be useful in the future but I agree this is overkill at this point, and even after we add it, keeping an alternative a single retention period parameter will be very useful for users.

I wonder if we should define some cut-off limit too and have it play nicely with split emission an merging.

Let me clarify: For compliance purpose, some user want a rather sharp cut: "I want all of my docs older than 30days at midnight to be deleted." In that case, we could decide to make sure that splits get cut at midnight, and a merge that would cross midnight should not happen. This assumes we implement this with regard to an ingestion timestamp...

In elastic/opensearch world, this is addressed by using index templates, which we can live without... But still it might be nice to let users define nice partitions between splits.

PSeitz commented 2 years ago

If the reason to delete is cost, we could consider to skip the merge part and only delete whole splits outside the period.

fulmicoton commented 2 years ago

@PSeitz Both exists. Some people want to delete stuff for cost reasons, some people are trying to enforce some policy. (e.g. GDPR)