grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.2k stars 3.36k forks source link

Table Manager is not deleting loki logs in s3, even after retention period. #11314

Open dewstyh opened 9 months ago

dewstyh commented 9 months ago

Describe the bug A clear and concise description of what the bug is. Table manager in loki cannot delete indexes that stored by loki. The bucket policy and role is attached to loki service account. Logs show no sign of error for deletion from table manager. I'm using loki-stack helm chart. But logs in grafana showing for mentioned retention period(24hours). But not deleting the logs from s3.

To Reproduce Steps to reproduce the behavior:

  1. Started Loki (SHA or version) Loki version - 2.6.1
  2. Started Promtail (SHA or version) to tail '...' Promtail version - 2.8.3

Expected behavior A clear and concise description of what you expected to happen.

It should delete logs after retention period from s3.

Environment:

resource "helm_release" "prometheus" { count = var.deploy_demo_apps ? 1 : 0 depends_on = [ kubernetes_namespace.monitoring, ] name = "kube-prometheus-stack" namespace = kubernetes_namespace.monitoring.metadata[0].name repository = "https://prometheus-community.github.io/helm-charts" chart = "kube-prometheus-stack"

values = [yamlencode(local.prometheus_settings)] }

resource "helm_release" "loki" { count = var.deploy_demo_apps ? 1 : 0 depends_on = [ kubernetes_namespace.monitoring, aws_s3_bucket.loki, ] name = "loki" namespace = kubernetes_namespace.monitoring.metadata[0].name repository = "https://grafana.github.io/helm-charts" chart = "loki-stack"

values = [yamlencode(local.loki_settings)] }

resource "aws_s3_bucket" "loki" { count = local.is_test_environment ? 0 : 1 bucket = local.loki_bucket_name } resource "aws_iam_role" "loki_role" { count = local.is_test_environment ? 0 : 1 name = "loki-eks-role"

assume_role_policy = <<EOF { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Service": "eks.amazonaws.com" }, "Action": "sts:AssumeRole" }, { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::${local.account_id}:oidc-provider/${local.oidc_provider}" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "${local.oidc_provider}:sub": "system:serviceaccount:monitoring:loki", "${local.oidc_provider}:aud": "sts.amazonaws.com" } } } ] } EOF }

resource "aws_iam_policy" "loki_policy" { count = local.is_test_environment ? 0 : 1 name = "loki-s3-policy" description = "IAM policy for Loki to access S3"

policy = <<EOF { "Version": "2012-10-17", "Statement": [ { "Sid": "LokiStorage", "Effect": "Allow", "Action": [ "s3:ListBucket", "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ], "Resource": [ "${aws_s3_bucket.loki[0].arn}", "${aws_s3_bucket.loki[0].arn}/*" ] } ] } EOF }

resource "aws_iam_role_policy_attachment" "loki_policy_attachment" { count = local.is_test_environment ? 0 : 1 policy_arn = aws_iam_policy.loki_policy[0].arn role = aws_iam_role.loki_role[0].name }

Screenshots, Promtail config, or terminal output If applicable, add any output to help explain your problem. Loki recent logs:

level=info ts=2023-11-24T15:33:36.629874322Z caller=table_manager.go:213 msg="syncing tables" ts=2023-11-24T15:33:36.630003231Z caller=spanlogger.go:80 level=info msg="building index list cache" level=info ts=2023-11-24T15:33:36.63945877Z caller=checkpoint.go:615 msg="starting checkpoint" level=info ts=2023-11-24T15:33:36.639979121Z caller=checkpoint.go:340 msg="attempting checkpoint for" dir=/data/loki/wal/checkpoint.000577 ts=2023-11-24T15:33:36.791883519Z caller=spanlogger.go:80 level=info msg="index list cache built" duration=161.841962ms level=info ts=2023-11-24T15:33:36.792091431Z caller=table_manager.go:252 msg="query readiness setup completed" duration=3.11µs distinct_users_len=0 ts=2023-11-24T15:33:42.63711288Z caller=spanlogger.go:80 level=info msg="building index list cache" ts=2023-11-24T15:33:42.637256136Z caller=spanlogger.go:80 level=info msg="index list cache built" duration=46.26µs level=info ts=2023-11-24T15:34:36.628775785Z caller=table_manager.go:134 msg="uploading tables" level=info ts=2023-11-24T15:34:36.628856105Z caller=index_set.go:86 msg="uploading table loki_index_19683" level=info ts=2023-11-24T15:34:36.628767143Z caller=table_manager.go:167 msg="handing over indexes to shipper" level=info ts=2023-11-24T15:34:36.629088987Z caller=table.go:319 msg="handing over indexes to shipper loki_index_19683" level=info ts=2023-11-24T15:34:36.629131034Z caller=table.go:335 msg="finished handing over table loki_index_19683" level=info ts=2023-11-24T15:34:36.629199185Z caller=table.go:319 msg="handing over indexes to shipper loki_index_19684" level=info ts=2023-11-24T15:34:36.62921143Z caller=table.go:335 msg="finished handing over table loki_index_19684" level=info ts=2023-11-24T15:34:36.629247497Z caller=table.go:319 msg="handing over indexes to shipper loki_index_19685" level=info ts=2023-11-24T15:34:36.629281432Z caller=table.go:335 msg="finished handing over table loki_index_19685" level=info ts=2023-11-24T15:34:36.629788032Z caller=index_set.go:107 msg="finished uploading table loki_index_19683" level=info ts=2023-11-24T15:34:36.629876857Z caller=index_set.go:185 msg="cleaning up unwanted indexes from table loki_index_19683" level=info ts=2023-11-24T15:34:36.629949518Z caller=index_set.go:86 msg="uploading table loki_index_19684" level=info ts=2023-11-24T15:34:36.62998962Z caller=index_set.go:107 msg="finished uploading table loki_index_19684" level=info ts=2023-11-24T15:34:36.630016023Z caller=index_set.go:185 msg="cleaning up unwanted indexes from table loki_index_19684" level=info ts=2023-11-24T15:34:36.630040273Z caller=index_set.go:86 msg="uploading table loki_index_19685" level=info ts=2023-11-24T15:34:36.630102034Z caller=index_set.go:107 msg="finished uploading table loki_index_19685" level=info ts=2023-11-24T15:34:36.630141364Z caller=index_set.go:185 msg="cleaning up unwanted indexes from table loki_index_19685" level=info ts=2023-11-24T15:35:36.62887278Z caller=table_manager.go:134 msg="uploading tables" level=info ts=2023-11-24T15:35:36.628940672Z caller=index_set.go:86 msg="uploading table loki_index_19683" level=info ts=2023-11-24T15:35:36.628862279Z caller=table_manager.go:167 msg="handing over indexes to shipper" level=info ts=2023-11-24T15:35:36.629259634Z caller=index_set.go:107 msg="finished uploading table loki_index_19683" level=info ts=2023-11-24T15:35:36.629288588Z caller=index_set.go:185 msg="cleaning up unwanted indexes from table loki_index_19683" level=info ts=2023-11-24T15:35:36.629302923Z caller=index_set.go:86 msg="uploading table loki_index_19684" level=info ts=2023-11-24T15:35:36.629314575Z caller=index_set.go:107 msg="finished uploading table loki_index_19684" level=info ts=2023-11-24T15:35:36.629329969Z caller=index_set.go:185 msg="cleaning up unwanted indexes from table loki_index_19684" level=info ts=2023-11-24T15:35:36.629345327Z caller=index_set.go:86 msg="uploading table loki_index_19685" level=info ts=2023-11-24T15:35:36.629371703Z caller=index_set.go:107 msg="finished uploading table loki_index_19685" level=info ts=2023-11-24T15:35:36.629404757Z caller=index_set.go:185 msg="cleaning up unwanted indexes from table loki_index_19685" level=info ts=2023-11-24T15:35:36.629688227Z caller=table.go:319 msg="handing over indexes to shipper loki_index_19683" level=info ts=2023-11-24T15:35:36.6297169Z caller=table.go:335 msg="finished handing over table loki_index_19683" level=info ts=2023-11-24T15:35:36.629767418Z caller=table.go:319 msg="handing over indexes to shipper loki_index_19684" level=info ts=2023-11-24T15:35:36.629782568Z caller=table.go:335 msg="finished handing over table loki_index_19684" level=info ts=2023-11-24T15:35:36.62982712Z caller=table.go:319 msg="handing over indexes to shipper loki_index_19685" level=info ts=2023-11-24T15:35:36.629845728Z caller=table.go:335 msg="finished handing over table loki_index_19685"

Note: I deployed loki on nov 21st 10;25am with retention period of two weeks, and then changed retention period to 24hours on nov 22nd 10:30am. Today nov 24th at 10;55am still no deletion of logs from s3 for atleast nov 22nd 10;30am to nov 23rd 10;30am. No logs from s3 got deleted till now.

diegocejasprieto commented 3 months ago

having the same issue

dewstyh commented 3 months ago

Table manger is not working as expected and I went with compactor to do that job for me. Example code to configure compactor with boltdb shipper.

config : { schema_config : { configs : [ { from : "2024-02-05", store : "boltdb-shipper", object_store : "s3", schema : "v11", index : { prefix : "lokiindex", period : "24h", }, }, ], }, storage_config : { aws : { s3 : "s3://${var.AWS_REGION}/${local.loki_bucket_name}", bucketnames : local.loki_bucket_name, region : var.AWS_REGION, }, boltdb_shipper : { shared_store : "s3", cache_ttl : "24h", }, }, compactor : { working_directory : "/data/loki/boltdb-shipper-compactor", shared_store : "s3", compaction_interval : "10m", retention_enabled : true, retention_delete_delay : "1h", retention_delete_worker_count : 150, }, query_range : { parallelise_shardable_queries : false, } limits_config : { retention_period : "168h", split_queries_by_interval : 0, }, }, serviceAccount : { annotations : { "eks.amazonaws.com/role-arn" : "arn:aws:iam::${local.account_id}:role/${aws_iam_role.loki_role.name}", }, },