grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.39k stars 3.39k forks source link

Old chunks not getting deleted after retention period #6300

Open wzjjack opened 2 years ago

wzjjack commented 2 years ago

Describe the bug I've configured 168h retention for my logs, but I can see chunks 5 years old filling my disk

To Reproduce

this is my config

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096
  grpc_server_max_recv_msg_size: 8388608
  grpc_server_max_send_msg_size: 8388608
querier:
  engine:
    max_look_back_period: 168h      

ingester:
  wal:
    enabled: true
    dir: /tmp/wal
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  chunk_idle_period: 24h       # Any chunk not receiving new logs in this time will be flushed
  max_chunk_age: 24h           # All chunks will be flushed when they hit this age, default is 1h
  chunk_target_size: 1048576  # Loki will attempt to build chunks up to 1.5MB, flushing first if chunk_idle_period or max_chunk_age is reached first
  chunk_retain_period: 5m    # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
  max_transfer_retries: 0     # Chunk transfers disabled

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

storage_config:
  boltdb_shipper:
    active_index_directory: /tmp/loki/boltdb-shipper-active
    cache_location: /tmp/loki/boltdb-shipper-cache
    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: filesystem
  filesystem:
    directory: /tmp/loki/chunks

compactor:
  working_directory: /tmp/loki/boltdb-shipper-compactor
  shared_store: filesystem

limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  max_streams_per_user: 1000000
  max_entries_limit_per_query: 5000000
  ingestion_rate_mb: 100
  ingestion_burst_size_mb: 20 
chunk_store_config:
  max_look_back_period: 168h

table_manager:
  retention_deletes_enabled: false
  retention_period: 168h

Expected behavior Chunks older than 168h should be deleted.

Environment:

Screenshots, Promtail config, or terminal output We can see 49 days of logs although I've configured 168h image

DeBuXer commented 2 years ago

I got the same issue. Logs older than 7 days are deleted and they're not visible in Grafana. Only the chunk files won't be deleted on the filesystem.

loki, version 2.5.0 (branch: HEAD, revision: 2d9d0ee23)
  build user:       root@4779f4b48f3a
  build date:       2022-04-07T21:50:00Z
  go version:       go1.17.6
  platform:         linux/amd64
server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

chunk_store_config:
  max_look_back_period: 168h

table_manager:
  retention_deletes_enabled: true
  retention_period: 168h
splitice commented 2 years ago

Use compactor not table_manager if you arent using AWS S3.

DeBuXer commented 2 years ago

Use compactor not table_manager if you arent using AWS S3.

Thanks, that did the trick :)

Mastedont commented 2 years ago

Hey @DeBuXer, could you post what you add to your config file in order to get deletion on s3 done?. I am running the same issue and I have not found the solution.

DeBuXer commented 2 years ago

@Mastedont, I don't use S3, I store my chunks directly on disk. My current configuration:

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

chunk_store_config:
  max_look_back_period: 168h

compactor:
  working_directory: /var/lib/loki/retention
  shared_store: filesystem
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150

limits_config:
  retention_period: 168h

ruler:
  alertmanager_url: http://127.0.0.1:9093
Mastedont commented 2 years ago

Thank you, @DeBuXer

Mastedont commented 2 years ago

One last question @DeBuXer.

Do you how can I know if log retention is working or not? What output written in logs will let know thats is working?

DeBuXer commented 2 years ago

@Mastedont, Not 100% sure, but I guess:

Jun 14 15:15:44 loki loki[277929]: level=info ts=2022-06-14T13:15:44.57537489Z caller=index_set.go:280 table-name=index_19150 msg="removing source db files from storage" count=1
Jun 14 15:15:44 loki loki[277929]: level=info ts=2022-06-14T13:15:44.576099223Z caller=compactor.go:495 msg="finished compacting table" table-name=index_19150
Mastedont commented 2 years ago

That log output is from Ingester?

I only can see outpur like this, despite of having Compactor enabled:

level=info ts=2022-06-14T14:05:17.574831148Z caller=table.go:358 msg="uploading table loki_pre_19157"
level=info ts=2022-06-14T14:05:17.574847901Z caller=table.go:385 msg="finished uploading table loki_pre_19157"
level=info ts=2022-06-14T14:05:17.57485537Z caller=table.go:443 msg="cleaning up unwanted dbs from table loki_pre_19157"
DeBuXer commented 2 years ago

That log output is from Ingester?

From /var/log/syslog but should have the same information. When compactor is enabled, you should see something like;

level=info ts=2022-06-14T14:24:56.072803949Z caller=compactor.go:324 msg="this instance has been chosen to run the compactor, starting compactor"
rickydjohn commented 2 years ago

@DeBuXer , thanks a lot for your support here. I don't see the chunk files getting rotated. I also see pretty old index directories as well. I wanted my logs to be rotated every 7 days. I am not sure what I am doing wrong here. Could you please help me with it?

auth_enabled: false
chunk_store_config:
  max_look_back_period: 168h

compactor:
  shared_store: filesystem
  working_directory: /data/loki/boltdb-shipper-compactor

ingester:
  chunk_block_size: 262144
  chunk_idle_period: 3m
  chunk_retain_period: 1m
  wal:
    dir: /data/loki/wal 
    flush_on_shutdown: true
  lifecycler:
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
  max_transfer_retries: 0

limits_config:
  enforce_metric_name: false
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  ingestion_rate_mb: 32
  ingestion_burst_size_mb: 36
  unordered_writes: true
  retention_period: 168h

schema_config:
  configs:
  - from: 2020-10-24
    index:
      period: 24h
      prefix: index_
    object_store: filesystem
    schema: v11
    store: boltdb-shipper

server:
  http_listen_port: 3100

storage_config:
  boltdb_shipper:
    active_index_directory: /data/loki/boltdb-shipper-active
    cache_location: /data/loki/boltdb-shipper-cache
    cache_ttl: 24h
    shared_store: filesystem
  filesystem:
    directory: /data/loki/chunks

table_manager:
  retention_deletes_enabled: true
  retention_period: 168h
DeBuXer commented 2 years ago

@rickydjohn, I think you need to enable retention_enabled. See also https://grafana.com/docs/loki/latest/operations/storage/retention/#retention-configuration

stale[bot] commented 2 years ago

Hi! This issue has been automatically marked as stale because it has not had any activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project. A stalebot can be very useful in closing issues in a number of cases; the most common is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, our sincere apologies if you find yourself at the mercy of the stalebot.

atze234 commented 2 years ago

Hi @Mastedont , did you manage to get the chunks deleted from s3? Im having that same problem, that i cannot see any logs about compactor and in s3 store there are older files (>7days) than my configured retention. It seems only the index is cleared, because grafana wont show older log entries.

ghost commented 2 years ago

Hi @Mastedont , did you manage to get the chunks deleted from s3? Im having that same problem, that i cannot see any logs about compactor and in s3 store there are older files (>7days) than my configured retention. It seems only the index is cleared, because grafana wont show older log entries.

I have the same problem

webfrank commented 1 year ago

Hi, I have this relevant config:

compactor:
      compaction_interval: 10m
      retention_delete_delay: 2h
      retention_delete_worker_count: 150
      retention_enabled: true
      shared_store: s3
      working_directory: /var/loki/retention
limits_config:
      enforce_metric_name: false
      max_cache_freshness_per_query: 10m
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      retention_period: 720h
      split_queries_by_interval: 30m

but log files are not deleted on S3, only index compacted.

jarrettprosser commented 1 year ago

I'm also finding this on loki 2.4.0, using minio as storage. Even with retention_delete_delay: 5m no chunks are being deleted.

Codecaver commented 1 year ago

@Mastedont, I don't use S3, I store my chunks directly on disk. My current configuration:

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

chunk_store_config:
  max_look_back_period: 168h

compactor:
  working_directory: /var/lib/loki/retention
  shared_store: filesystem
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150

limits_config:
  retention_period: 168h

ruler:
  alertmanager_url: http://127.0.0.1:9093

Hi,will this configuration clean up expired files in the chunks directory?

patsevanton commented 1 year ago

any update ?

seany89 commented 1 year ago

Judging from the discussion in this issue https://github.com/grafana/loki/issues/7068 I don't think the compactor will delete the chunks in s3 object store you need a bucket lifecycle policy for that.

It would be nice to have a clear answer on this though.

nvanheuverzwijn commented 1 year ago

For everyone wondering what's going with the retention, I've tested the feature a lot in the past days so here are what will work.

Minimal Configuration Needed

First of all, you absolutely need those config setup

limits_config:
        retention_period: 10d # Keep 10 days
compactor:
        delete_request_cancel_period: 10m # don't wait 24h before processing the delete_request
        retention_enabled: true # actually do the delete
        retention_delete_delay: 2h # wait 2 hours before actually deleting stuff

You can tweak those config to delete faster or slower.

Check If It's Working

Once you got those config up and running, check that the logs are actually reporting that the retention is being applied : msg="applying retention with compaction". The "caller" for this log is compactor.go.

Next, check that the retention manager is actually doing it's job in the logs: msg="mark file created" and msg="no marks file found" from the caller marker.go.

The mark file created means that loki did found some chunks to be deleted and it has created a file to keep track of it. The no marks file found means that while performing the chunk delete routine, there was no file that matched it's filters, the filters mainly being the delay.

Whenever you see the mark file created logs, you can go into the working directory of the compactor and check for the mark files. The path should be something like /var/loki/compactor/retention/markers. These files are kept there for 2 hours or whatever is set in retention_delete_delay. After retention_delete_delay is passed, loki will delete the chunks.

Not having any of the logs mentionned above means that the retention process is not started.

Important Notes

Loki will only delete chunks that are indexed. The indexes are actually being purged before deleting the chunks. This means that if you lose files from the compactor's working directory, whatever chunks that were marked there won't be deleted ever so it is still worth to have a lifecycle policy to cover for this OR have persistent storage for this particular folder.

amseager commented 1 year ago

@nvanheuverzwijn if I were the CTO of Grafana Labs, I would give you a job offer immediately

adthonb commented 1 year ago

@nvanheuverzwijn Thank you a lot! Your explanation makes me clear. The Loki document makes me confuse that Table Manager also deletes chucks when using the filesystem chuck store.

yangmeilly commented 1 year ago

more info about 'Check If It's Working' such as compaction_interval: 10m assuming the loki instance start at 2023-07-11T12:30:25.060395045Z, then there are logs about caller=compactor.go at ts=2023-07-11T12:40:25.047110295Z

level=info ts=2023-07-11T12:30:25.060441045Z caller=compactor.go:440 msg="waiting 10m0s for ring to stay stable and previous compactions to finish before starting compactor" level=info ts=2023-07-11T12:40:25.045542628Z caller=compactor.go:445 msg="compactor startup delay completed" level=info ts=2023-07-11T12:40:25.045568295Z caller=compactor.go:497 msg="compactor started" level=info ts=2023-07-11T12:40:25.04562367Z caller=compactor.go:454 msg="applying retention with compaction" level=info ts=2023-07-11T12:40:25.047110295Z caller=compactor.go:609 msg="compacting table" table-name=loki_index_19549 level=info ts=2023-07-11T12:40:25.047208753Z caller=table_compactor.go:325 table-name=loki_index_19549 msg="using compactor-1689078092.gz as seed file" level=info ts=2023-07-11T12:40:25.048495753Z caller=util.go:85 table-name=loki_index_19549 file-name=compactor-1689078092.gz msg="downloaded file" total_time=1.280041ms level=info ts=2023-07-11T12:40:25.06665592Z caller=compactor.go:614 msg="finished compacting table" table-name=loki_index_19549 level=info ts=2023-07-11T12:40:25.066668503Z caller=compactor.go:609 msg="compacting table" table-name=loki_index_19548 level=info ts=2023-07-11T12:40:25.067591628Z caller=util.go:85 table-name=loki_index_19548 file-name=compactor-1689041382.gz msg="downloaded file" total_time=863.125µs level=info ts=2023-07-11T12:40:25.078401878Z caller=compactor.go:614 msg="finished compacting table" table-name=loki_index_19548

Nurlan199206 commented 1 year ago

@yangmeilly

you can send full config pls loki.yaml?

for me not working.

yangmeilly commented 1 year ago

@yangmeilly

you can send full config pls loki.yaml?

for me not working.

in my scenario, using boltdb-shipper for indexs and filesystem for chunks. the full loki config as following, and bold should have your attention.

compactor block: compaction_interval: 10m delete_request_cancel_period: 2h retention_delete_delay: 2h retention_delete_worker_count: 150 retention_enabled: true shared_store: filesystem working_directory: /var/loki/retention

limits_config block: enforce_metric_name: false max_cache_freshness_per_query: 10m reject_old_samples: true reject_old_samples_max_age: 168h split_queries_by_interval: 15m etention_period: 72h max_query_lookback: 72h

table_manager block: // this make nosense for filesystem retention_deletes_enabled: false retention_period: 0

HammerNL89 commented 1 year ago

Whenever you see the mark file created logs, you can go into the working directory of the compactor and check for the mark files. The path should be something like /var/loki/compactor/retention/markers. These files are kept there for 2 hours or whatever is set in retention_delete_delay. After retention_delete_delay is passed, loki will delete the chunks.

Not having any of the logs mentionned above means that the retention process is not started.

@nvanheuverzwijn Thanks for the info. Regarding your statement, loki will delete the chunks, are you talking about a filesystem backend or also a s3/azure backend? I can't find a definitive answer stating that loki is able to delete chunks from external storage.

nvanheuverzwijn commented 1 year ago

It will also delete on S3/Azure. I did this with Google Cloud storage but it should be the same for the other backend.

Le mar. 25 juill. 2023, 08 h 55, HammerNL89 @.***> a écrit :

Whenever you see the mark file created logs, you can go into the working directory of the compactor and check for the mark files. The path should be something like /var/loki/compactor/retention/markers. These files are kept there for 2 hours or whatever is set in retention_delete_delay. After retention_delete_delay is passed, loki will delete the chunks.

Not having any of the logs mentionned above means that the retention process is not started.

@nvanheuverzwijn https://github.com/nvanheuverzwijn Thanks for the info. Regarding your statement, loki will delete the chunks, are you talking about a filesystem backend or also a s3/azure backend? I can't find a definitive answer stating that loki is able to delete chunks from external storage.

— Reply to this email directly, view it on GitHub https://github.com/grafana/loki/issues/6300#issuecomment-1649788775, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAHGI6RV26I2EEQSLUL2GJ3XR664JANCNFSM5XXPS5YA . You are receiving this because you were mentioned.Message ID: @.***>

stringang commented 1 year ago

@nvanheuverzwijn compactor did not delete the chunk. why?

compactor log:

level=info ts=2023-08-03T06:50:12.634846248Z caller=compactor.go:497 msg="compactor started"
level=info ts=2023-08-03T06:50:12.634865722Z caller=compactor.go:454 msg="applying retention with compaction"
level=info ts=2023-08-03T06:50:12.634865349Z caller=marker.go:177 msg="mark processor started" workers=150 delay=2h0m0s
level=info ts=2023-08-03T06:50:12.634955656Z caller=expiration.go:78 msg="overall smallest retention period 1690440612.634, default smallest retention period 1690440612.634"
ts=2023-08-03T06:50:12.635021334Z caller=spanlogger.go:85 level=info msg="building index list cache"
level=info ts=2023-08-03T06:50:12.635046761Z caller=marker.go:202 msg="no marks file found"

config:

storage_config:
  aws:
    access_key_id: xxxxxx
    bucketnames: loki
    endpoint: https://s3.xxxx.com
    s3forcepathstyle: true
    secret_access_key: xxxxx
  boltdb_shipper:
    active_index_directory: /var/loki/index
    cache_location: /var/loki/cache
    cache_ttl: 24h
    index_gateway_client:
      server_address: dns:///loki-distributed-index-gateway:9095
    shared_store: s3

compactor:
  retention_enabled: true
  shared_store: s3
  working_directory: /var/loki/compactor
  retention_delete_delay: 2h
  delete_request_cancel_period: 10m

limits_config:
  enforce_metric_name: false
  ingestion_burst_size_mb: 1024
  ingestion_rate_mb: 1024
  max_cache_freshness_per_query: 10m
  max_global_streams_per_user: 0
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  retention_period: 1h
  split_queries_by_interval: 15m

update: It was caused by my incorrect configuration.

The storage configuration needs to be placed in the common.

common:
  compactor_address: http://loki-distributed-compactor:3100
  storage:
    s3:
      access_key_id: xxxxxx
      bucketnames: loki
      endpoint: https://s3.xxxx.com
      s3forcepathstyle: true
      secret_access_key: xxxxxx
yangfan-witcher commented 1 year ago

@nvanheuverzwijn so beautiful

ningyougang commented 1 year ago

@stringang can you share the whole loki.yaml?

I am also testing, log files are not deleted from S3, only index compacted,

  1. index compacted, e.g. image
  2. log files are not deleted, e.g. image

My whole configuartion

    auth_enabled: false
    chunk_store_config:
      max_look_back_period: 0s
    compactor:
      shared_store: s3
      working_directory: /data/loki/boltdb-shipper-compactor
      retention_enabled: true
      compaction_interval: 10m
      retention_delete_delay: 2h
    distributor:
      ring:
        kvstore:
          store: memberlist
    ingester:
      chunk_block_size: 262144
      chunk_idle_period: 3m
      chunk_retain_period: 1m
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 1
      max_transfer_retries: 0
      query_store_max_look_back_period: 0
      wal:
        enabled: true
        dir: /data/wal
    querier:
      max_concurrent: 20
    limits_config:
      ingestion_rate_mb: 8
      ingestion_burst_size_mb: 16
      per_stream_rate_limit: 5MB
      per_stream_rate_limit_burst: 15MB
      enforce_metric_name: false
      reject_old_samples: true
      reject_old_samples_max_age: 24h
      retention_period: 24h
    memberlist:
      join_members:
        - loki-headless
    schema_config:
      configs:
      - from: "2022-11-20"
        store: boltdb-shipper
        object_store: s3
        schema: v11
        index:
          period: 24h
          prefix: index_
        chunks:
          period: 24h
          prefix: chunks_
    server:
      http_listen_port: 3100
      log_level: debug
    storage_config:
      object_prefix: test-nyg
      boltdb_shipper:
        active_index_directory: /data/loki/boltdb-shipper-active
        cache_location: /data/loki/boltdb-shipper-cache
        cache_ttl: 24h
        shared_store: s3
    common:
      storage:
        s3:
          s3: s3://test-admin:test-admin@s3Address:10000/$bucketName
          s3forcepathstyle: true

Do i miss something?

stringang commented 1 year ago

@ningyougang

auth_enabled: false
chunk_store_config:
  max_look_back_period: 0s
common:
  compactor_address: http://loki-distributed-compactor:3100
  storage:
    s3:
      access_key_id: xxxxxxxxxxx
      bucketnames: loki
      endpoint: https://s3.xxxxxx.com
      s3forcepathstyle: true
      secret_access_key: xxxxxxxxxxx
compactor:
  delete_request_cancel_period: 10m
  retention_delete_delay: 1h
  retention_enabled: true
  shared_store: s3
  working_directory: /var/loki/compactor
distributor:
  ring:
    kvstore:
      store: memberlist
frontend:
  compress_responses: true
  log_queries_longer_than: 5s
  tail_proxy_url: http://loki-distributed-querier:3100
frontend_worker:
  frontend_address: loki-distributed-query-frontend-headless:9095
ingester:
  chunk_block_size: 262144
  chunk_encoding: snappy
  chunk_idle_period: 1h
  chunk_retain_period: 1m
  chunk_target_size: 8388608
  lifecycler:
    join_after: 10s
    observe_period: 5s
    ring:
      heartbeat_timeout: 10m
      kvstore:
        store: memberlist
      replication_factor: 3
  max_transfer_retries: 0
  wal:
    dir: /var/loki/wal
ingester_client:
  grpc_client_config:
    grpc_compression: gzip
limits_config:
  enforce_metric_name: false
  ingestion_burst_size_mb: 1024
  ingestion_rate_mb: 1024
  max_cache_freshness_per_query: 10m
  max_global_streams_per_user: 0
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  retention_period: 1d
  split_queries_by_interval: 15m
memberlist:
  join_members:
  - loki-distributed-memberlist
query_range:
  align_queries_with_step: true
  cache_results: true
  max_retries: 5
  results_cache:
    cache:
      embedded_cache:
        enabled: true
        ttl: 24h
ruler:
  alertmanager_url: http://am.xxxxxxxxxxx.com/
  enable_alertmanager_v2: true
  enable_api: true
  enable_sharding: true
  ring:
    kvstore:
      store: memberlist
  rule_path: /tmp/loki/scratch
  storage:
    local:
      directory: /etc/loki/rules
    type: local
runtime_config:
  file: /var/loki-distributed-runtime/runtime.yaml
schema_config:
  configs:
  - from: "2023-08-12"
    index:
      period: 24h
      prefix: loki_index_
    object_store: s3
    schema: v11
    store: boltdb-shipper
server:
  grpc_server_max_recv_msg_size: 8388608
  http_listen_port: 3100
  log_level: debug
storage_config:
  boltdb_shipper:
    active_index_directory: /var/loki/index
    cache_location: /var/loki/cache
    cache_ttl: 24h
    index_gateway_client:
      server_address: dns:///loki-distributed-index-gateway:9095
    shared_store: s3
table_manager:
  retention_deletes_enabled: false
  retention_period: 0s
ben5556 commented 1 year ago

@nvanheuverzwijn I do see mark file created but then a few mins later the logs show

level=info ts=2023-09-11T05:23:41.019592474Z caller=compactor.go:364 msg="applying retention with compaction"
level=info ts=2023-09-11T05:23:41.020937376Z caller=expiration.go:60 msg="overall smallest retention period 1691817821.02, default smallest retention period 1691817821.02"
level=info ts=2023-09-11T05:23:41.065287615Z caller=marker.go:78 msg="mark file created" file=/data/loki/boltdb-shipper-compactor/retention/markers/1694409821056046054

Few secs later

level=info ts=2023-09-11T05:24:41.019702032Z caller=marker.go:203 msg="no marks file found"
level=info ts=2023-09-11T05:25:41.020103387Z caller=marker.go:203 msg="no marks file found"
level=info ts=2023-09-11T05:26:41.020436386Z caller=marker.go:203 msg="no marks file found"
level=info ts=2023-09-11T05:27:41.020020721Z caller=marker.go:203 msg="no marks file found"
level=info ts=2023-09-11T05:28:41.01973026Z caller=marker.go:203 msg="no marks file found"
level=info ts=2023-09-11T05:30:41.020528317Z caller=marker.go:203 msg="no marks file found"

So looks like even after 5 mins (retention_delete_delay) it still cannot find the mark file. I verified that the mark file exists in that location Below is my loki config

auth_enabled: false
chunk_store_config:
  max_look_back_period: 0s
compactor:
  delete_request_cancel_period: 10m
  retention_delete_delay: 5m
  retention_delete_worker_count: 150
  retention_enabled: true
  shared_store: filesystem
  working_directory: /data/loki/boltdb-shipper-compactor
ingester:
  chunk_block_size: 262144
  chunk_idle_period: 3m
  chunk_retain_period: 1m
  lifecycler:
    ring:
      replication_factor: 1
  max_transfer_retries: 0
  wal:
    dir: /data/loki/wal
limits_config:
  enforce_metric_name: false
  max_entries_limit_per_query: 5000
  max_query_lookback: 720h
  reject_old_samples: true
  reject_old_samples_max_age: 168h
  retention_period: 720h
memberlist:
  join_members:
  - 'loki-memberlist'
schema_config:
  configs:
  - from: "2020-10-24"
    index:
      period: 24h
      prefix: index_
    object_store: filesystem
    schema: v11
    store: boltdb-shipper
server:
  grpc_listen_port: 9095
  http_listen_port: 3100
  http_server_read_timeout: 120s
kirannhegde commented 12 months ago

This is not working for me even with v2.8.4 version of Grafana Loki

Loki version: v2.8.4 AKS version: 1.23.x Storage backend: azure blob storage

From reading the documenation for compactor, i get the impression that compactor is capable of deletion of old chunks from the blob storage. However, i see that the old chunks of Loki are not getting deleted from the blob stroage even though the necessary configuration is in place. Could somone be kind enough to tell me what could be wrong in my configuration? When i inspect the logs for the compactor, i see that the marker file is being created. I also see a lot of API calls to: GET /loki/api/v1/delete. However, i dont see any POST calls to /loki/api/v1/delete. So this gives me the impression that no deletion is happening now. I confirmed the same that chunks from several months ago are still lying in my blob storage.

auth_enabled: false

    server:
      http_listen_port: {{ .Values.loki.containerPorts.http }}
      log_level: debug
    common:
      compactor_address: http://{{ include "grafana-loki.compactor.fullname" . }}:{{ .Values.compactor.service.ports.http }}
      storage:
        azure:
          account_name: abc 
          account_key: abc
          container_name: abc
          use_managed_identity: false
          request_timeout: 0 

    distributor:
      ring:
        kvstore:
          store: memberlist

    memberlist:
      join_members:
        - {{ include "grafana-loki.gossip-ring.fullname" . }}

    ingester:
      lifecycler:
        ring:
          kvstore:
            store: memberlist
          replication_factor: 1
      chunk_idle_period: 2h                # Any chunk not receiving new logs in this time will be flushed
      chunk_block_size: 262144
      chunk_encoding: snappy
      chunk_retain_period: 1m
      max_chunk_age: 2h                     # All chunks will be flushed when they hit this age, default is 1h
      max_transfer_retries: 0
      autoforget_unhealthy: true
      wal:
        dir: {{ .Values.loki.dataDir }}/wal

    limits_config:
      retention_period: 48h
      enforce_metric_name: false
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      max_cache_freshness_per_query: 10m
      split_queries_by_interval: 15m
      per_stream_rate_limit: 10MB
      per_stream_rate_limit_burst: 20MB
      ingestion_rate_mb: 100
      ingestion_burst_size_mb: 30

    schema_config:
      configs:
      - from: 2020-10-24
        store: boltdb-shipper
        object_store: azure
        schema: v11
        index:
          prefix: index_
          period: 24h
        chunks:
            period: 24h

    storage_config:    
      boltdb_shipper:
        shared_store: azure
        active_index_directory: {{ .Values.loki.dataDir }}/loki/index
        cache_location: {{ .Values.loki.dataDir }}/loki/cache
        cache_ttl: 168h
        {{- if .Values.indexGateway.enabled }}
        index_gateway_client:
          server_address: {{ (printf "dns:///%s:9095" (include "grafana-loki.index-gateway.fullname" .)) }}
        {{- end }}
      filesystem:
        directory: {{ .Values.loki.dataDir }}/chunks
      index_queries_cache_config:
        {{- if .Values.memcachedindexqueries.enabled }}
        memcached:
          batch_size: 100
          parallelism: 100
        memcached_client:
          consistent_hash: true
          addresses: dns+{{ include "grafana-loki.memcached-index-queries.host" . }}
          service: http
        {{- end }}

    chunk_store_config:
      max_look_back_period: 2d
      {{- if .Values.memcachedchunks.enabled }}
      chunk_cache_config:
        memcached:
          batch_size: 100
          parallelism: 100
        memcached_client:
          consistent_hash: true
          addresses: dns+{{ include "grafana-loki.memcached-chunks.host" . }}
      {{- end }}
      {{- if .Values.memcachedindexwrites.enabled }}
      write_dedupe_cache_config:
        memcached:
          batch_size: 100
          parallelism: 100
        memcached_client:
          consistent_hash: true
          addresses: dns+{{ include "grafana-loki.memcached-index-writes.host" . }}
      {{- end }}

    table_manager:
      retention_deletes_enabled: true
      retention_period: 2d

    query_range:
      align_queries_with_step: true
      max_retries: 5
      cache_results: true
      results_cache:
        cache:
          {{- if .Values.memcachedfrontend.enabled }}
          memcached_client:
            consistent_hash: true
            addresses: dns+{{ include "grafana-loki.memcached-frontend.host" . }}
            max_idle_conns: 16
            timeout: 500ms
            update_interval: 1m
          {{- else }}
          enable_fifocache: true
          fifocache:
            max_size_items: 1024
            validity: 24h
          {{- end }}
    {{- if not .Values.queryScheduler.enabled }}
    frontend_worker:
      frontend_address: {{ include "grafana-loki.query-frontend.fullname" . }}:{{ .Values.queryFrontend.service.ports.grpc }}
    {{- end }}

    frontend:
      log_queries_longer_than: 5s
      compress_responses: true
      tail_proxy_url: http://{{ include "grafana-loki.querier.fullname" . }}:{{ .Values.querier.service.ports.http }}

    compactor:
      working_directory: {{ .Values.loki.dataDir }}/retention
      shared_store: azure
      compaction_interval: 10m
      retention_enabled: true
      retention_delete_delay: 2h
      retention_delete_worker_count: 150

    ruler:
      storage:
        type: local
        local:
          directory: {{ .Values.loki.dataDir }}/conf/rules
      ring:
        kvstore:
          store: memberlist
      rule_path: /tmp/loki/scratch
      alertmanager_url: http://abc.bdc.com/alertmanager
      external_url: https://abc.bdc.com/alertmanager
tomasz-kazmierczak commented 11 months ago

Hi, I had a similar issue with recent 2.9.1 version of loki. It appears that recently there has been some work with the property deletion_mode (https://grafana.com/docs/loki/latest/operations/storage/logs-deletion/). This property is now a per tenant configurable within the runtime config, the default value is not documented but I had to enforce filter-and-delete mode and it started to delete chunks from my object storage.

The helm values config for that setting is:

loki:
  runtimeConfig:
    overrides:
      fake:
        deletion_mode: filter-and-delete

I hope this solves your issues.

alexmeise commented 8 months ago

As far as I understood from https://github.com/grafana/loki/issues/7068#issuecomment-1347131945 chunks are never going to get deleted from s3 by the compactor. It is not possible to configure retention so that the compactor actually deletes the chunks from S3. This needs to be done via lifecyclePolicy or some other mechanism. What the compactor will do is manage the chunk/index relationship so that you dont receive issues for "deleted chunks/indexes" when running queries. I guess it will actually delete references to chunks in the indexes but not the chunk files themselves. Am I right?? It's a bit conter-intuitive and would help if this is clarified in the documentation.

szthanatos commented 8 months ago

@Mastedont, I don't use S3, I store my chunks directly on disk. My current configuration:

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

chunk_store_config:
  max_look_back_period: 168h

compactor:
  working_directory: /var/lib/loki/retention
  shared_store: filesystem
  compaction_interval: 10m
  retention_enabled: true
  retention_delete_delay: 2h
  retention_delete_worker_count: 150

limits_config:
  retention_period: 168h

ruler:
  alertmanager_url: http://127.0.0.1:9093

mark

otaku-bowei commented 7 months ago

For everyone wondering what's going with the retention, I've tested the feature a lot in the past days so here are what will work.

Minimal Configuration Needed

First of all, you absolutely need those config setup

limits_config:
        retention_period: 10d # Keep 10 days
compactor:
        delete_request_cancel_period: 10m # don't wait 24h before processing the delete_request
        retention_enabled: true # actually do the delete
        retention_delete_delay: 2h # wait 2 hours before actually deleting stuff

You can tweak those config to delete faster or slower.

Check If It's Working

Once you got those config up and running, check that the logs are actually reporting that the retention is being applied : msg="applying retention with compaction". The "caller" for this log is compactor.go.

Next, check that the retention manager is actually doing it's job in the logs: msg="mark file created" and msg="no marks file found" from the caller marker.go.

The mark file created means that loki did found some chunks to be deleted and it has created a file to keep track of it. The no marks file found means that while performing the chunk delete routine, there was no file that matched it's filters, the filters mainly being the delay.

Whenever you see the mark file created logs, you can go into the working directory of the compactor and check for the mark files. The path should be something like /var/loki/compactor/retention/markers. These files are kept there for 2 hours or whatever is set in retention_delete_delay. After retention_delete_delay is passed, loki will delete the chunks.

Not having any of the logs mentionned above means that the retention process is not started.

Important Notes

Loki will only delete chunks that are indexed. The indexes are actually being purged before deleting the chunks. This means that if you lose files from the compactor's working directory, whatever chunks that were marked there won't be deleted ever so it is still worth to have a lifecycle policy to cover for this OR have persistent storage for this particular folder.

@nvanheuverzwijn learn a lot, how can i know which chunks are indexed?

myaswanth03 commented 3 months ago

I got the same issue. Logs older than 7 days are deleted and they're not visible in Grafana. Only the chunk files won't be deleted on the filesystem.

loki, version 2.5.0 (branch: HEAD, revision: 2d9d0ee23)
  build user:       root@4779f4b48f3a
  build date:       2022-04-07T21:50:00Z
  go version:       go1.17.6
  platform:         linux/amd64
server:
  http_listen_port: 3100
  grpc_listen_port: 9096

common:
  path_prefix: /var/lib/loki
  storage:
    filesystem:
      chunks_directory: /var/lib/loki/chunks
      rules_directory: /var/lib/loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h

chunk_store_config:
  max_look_back_period: 168h

table_manager:
  retention_deletes_enabled: true
  retention_period: 168h

Hi, i used the same file, loki is up and running but retention is not working.

permissions are also fine.

level=info ts=2024-05-30T07:12:05.009458014Z caller=metrics.go:159 component=frontend org_id=fake traceID=6cc7b11248ffd6f2 latency=fast query="sum by (level) (count_over_time({job=\"containerlogs\"} | drop error[1s]))" query_hash=1978880347 query_type=metric range_type=range length=5m1s start_delta=5m2.009446978s end_delta=1.009448479s step=1s duration=930.872366ms status=200 limit=100 returned_lines=0 throughput=12kB total_bytes=11kB total_bytes_structured_metadata=0B lines_per_second=82 total_lines=77 post_filter_lines=77 total_entries=1 store_chunks_download_time=0s queue_time=100.027997ms splits=0 shards=16 cache_chunk_req=0 cache_chunk_hit=0 cache_chunk_bytes_stored=0 cache_chunk_bytes_fetched=0 cache_chunk_download_time=0s cache_index_req=0 cache_index_hit=0 cache_index_download_time=0s cache_stats_results_req=0 cache_stats_results_hit=0 cache_stats_results_download_time=0s cache_result_req=0 cache_result_hit=0 cache_result_download_time=0s source=logvolhist level=info ts=2024-05-30T07:12:05.180281106Z caller=roundtrip.go:241 org_id=fake traceID=7306103002fd206d msg="executing query" type=range query="{job=\"containerlogs\"}" length=5m0s step=5m0s query_hash=3605303691 level=info ts=2024-05-30T07:12:05.180937762Z caller=engine.go:234 component=querier org_id=fake traceID=7306103002fd206d msg="executing query" type=range query="{job=\"containerlogs\"}" length=5m0s step=5m0s query_hash=3605303691

krishgu commented 1 month ago

For everyone wondering what's going with the retention, I've tested the feature a lot in the past days so here are what will work.

Minimal Configuration Needed

First of all, you absolutely need those config setup

limits_config:
        retention_period: 10d # Keep 10 days
compactor:
        delete_request_cancel_period: 10m # don't wait 24h before processing the delete_request
        retention_enabled: true # actually do the delete
        retention_delete_delay: 2h # wait 2 hours before actually deleting stuff

You can tweak those config to delete faster or slower.

Check If It's Working

Once you got those config up and running, check that the logs are actually reporting that the retention is being applied : msg="applying retention with compaction". The "caller" for this log is compactor.go. Next, check that the retention manager is actually doing it's job in the logs: msg="mark file created" and msg="no marks file found" from the caller marker.go. The mark file created means that loki did found some chunks to be deleted and it has created a file to keep track of it. The no marks file found means that while performing the chunk delete routine, there was no file that matched it's filters, the filters mainly being the delay. Whenever you see the mark file created logs, you can go into the working directory of the compactor and check for the mark files. The path should be something like /var/loki/compactor/retention/markers. These files are kept there for 2 hours or whatever is set in retention_delete_delay. After retention_delete_delay is passed, loki will delete the chunks. Not having any of the logs mentionned above means that the retention process is not started.

Important Notes

Loki will only delete chunks that are indexed. The indexes are actually being purged before deleting the chunks. This means that if you lose files from the compactor's working directory, whatever chunks that were marked there won't be deleted ever so it is still worth to have a lifecycle policy to cover for this OR have persistent storage for this particular folder.

@nvanheuverzwijn learn a lot, how can i know which chunks are indexed?

The above article by @nvanheuverzwijn is so on point. Thank you! There are a lot of docs of setting Lifecycle Policies as the "only" way to purge, and those are outdated. This setup seems to be working fine.

In the compactor or the backend pod, the working_dir/retention/storage/markers folder will have these marker files.

/var/loki/compactor/retention/aws/markers $ ls -la
total 312
drwxr-sr-x    2 loki     loki          4096 Aug  8 19:50 .
drwxr-sr-x    3 loki     loki          4096 Aug  7 19:59 ..
-rw-r--r--    1 loki     loki         32768 Aug  8 18:09 1723140578748104859
-rw-r--r--    1 loki     loki         32768 Aug  8 18:19 1723141178696907357
...
-rw-r--r--    1 loki     loki         32768 Aug  8 19:49 1723146578638477113
-rw-r--r--    1 loki     loki         32768 Aug  8 19:59 1723147178354528360

If the compactor cycle is running every 10 minutes (default), you will see a new file created and the oldest file processed and deleted every 10 minutes. And in S3 or your storage providers these objects no longer exist!

To find out what objects are about to be deleted, use the strings or od -c on the market file. So doing this on the oldest marker file will tell us about the next-in-line objects that are purged

/var/loki/compactor/retention/aws/markers $ strings 1723140578748104859
chunks
fake/65ff478cf0700326:190eb0ab615:190eb125142:ee83e028
fake/36bfb27dc259bf11:190eaa1db48:190eb0fef22:7101872a
fake/eecc8a1a1bba16ea:190eb0a8a31:190eb122448:156be59e
fake/dd094d6bff631bd5:190eaa581d0:190eb13688f:62c6862c
fake/f78ea3e78078c811:190eaa51fb0:190eb12fcfd:41bb1a13
fake/225757be61ef1d7a:190eaa50dad:190eb12f961:48b0c78c
fake/f4a7b9aef3c86224:190eb0c4cb2:190eb13a35e:a2919ec1
  1. To prove there is no old data, use the aws s3api command as below to list the oldest object.. You should not see any objects older than your retention_period if all went per plan
  2. you see that this is the oldest object is fake/36bfb27dc259bf1..., and this was in the output of the previous strings 1723140578748104859 command (2nd of the fake/... objects). 10 minutes later you will see this deleted. This tallies correctly with the 14d retention period.
$ aws s3api list-objects-v2 --bucket dframe-loki- --prefix fake --query 'sort_by(Contents, &LastModified)[0]' --output json
{
    "Key": "fake/36bfb27dc259bf11:190eaa1db48:190eb0fef22:7101872a",
    "LastModified": "2024-07-25T18:03:45.000Z",
    "ETag": "\"606ec603a5cb748a96a3ea3e4fb11b87\"",
    "Size": 7000,
    "StorageClass": "STANDARD"

}

This ticket can be closed, not sure why it is still open...

kdvermagojoko commented 1 month ago

Hi, I have below conf in my helm override file for having retention period of 24hr, but still i see old index file in S3. I am using loki-stack-2.10.2 chart. Any idea what i am missing here?

loki:
  serviceAccount:
    name: loki-service-account
    create: false
  config:
    schema_config:
      configs:
        - from: 2020-10-24
          store: boltdb-shipper
          object_store: s3
          schema: v11
          index:
            prefix: loki_index_
            period: 24h
    storage_config:
      aws:
        s3: s3://aws_region/s3-bucket
        s3forcepathstyle: true
        bucketnames: s3-bucket
        region: aws_region
        insecure: false
        sse_encryption: false
      boltdb_shipper:
        shared_store: s3
        cache_ttl: 24h
    limits_config:
      enforce_metric_name: false
      reject_old_samples: true
      reject_old_samples_max_age: 24h
      max_entries_limit_per_query: 5000
      retention_period: 24h
      max_query_lookback: 24h
    compactor:
      working_directory: /data/loki/boltdb-shipper-compactor
      shared_store: filesystem
      retention_enabled: true
      delete_request_cancel_period: 24h
      retention_delete_delay: 2h
      retention_delete_worker_count: 150
      compaction_interval: 24h
    table_manager:
      retention_deletes_enabled: true
      retention_period: 24h
younsl commented 3 weeks ago

Here's my loki-distributed chart configuration backed by S3 storage. Need to set delete_request_store in compactor config, too.

[!NOTE] compactor.delete_request_store should be set to configure the store for delete requests. This is required when retention is enabled. See loki's retention doc.

# charts/loki-distributed/values.yaml
loki:
  config: |
    compactor:
      shared_store: s3
      working_directory: /var/loki/compactor
      compaction_interval: 10m
      retention_enabled: true
      retention_delete_delay: 2h
      retention_delete_worker_count: 150
      delete_request_store: s3

    limits_config:
      enforce_metric_name: false
      reject_old_samples: true
      reject_old_samples_max_age: 168h
      max_cache_freshness_per_query: 10m
      split_queries_by_interval: 15m
      retention_period: 7d
      ingestion_rate_mb: 20
      ingestion_burst_size_mb: 30

Reference

issue #9207