"Error loading cache generation numbers" came out on Loki-read

duj4 commented 3 months ago

Describe the bug I am running Loki 3.1.0 in SSD mode with retention_enabled as true, but when the stack is up and running for a while, loki-read pod starts complaining error as below:

Per the link https://grafana.com/docs/loki/latest/operations/troubleshooting/#cache-generation-errors, I found the metrics of loki_delete_cache_gen_load_failures_total is larger than 1 and it requires to set allow_deletes as true, but this flag has been marked as deprecated in current version and as the substitution, deletion_mode is set as filter-and-delete already.

compactor:

limits:

If deletion_mode has been set in limits_config, do I have to set it again in runtime_config for each tenant? If allow_deletes has been marked as deprecated, do I need to set it as true still?

To Reproduce Steps to reproduce the behavior:

Start Loki in SSD mode with retention enabled
Wait for a while and check the log files of Loki-read pod

Expected behavior There should be no error post out if filter-and-delete is set correctly.

Environment:

Infrastructure: K8S
Deployment tool: helm-chart

Hitesh-Agrawal commented 3 months ago

Am too facing this error, using grafana/loki helm-chart 6.8.0 with app version 3.1.0

level=error ts=2024-08-08T05:23:38.230753658Z caller=http.go:107 msg="error getting delete requests from the store" err="unexpected status code: 404" ts=2024-08-08T05:23:38.230776322Z caller=spanlogger.go:109 user=fake level=error msg="failed loading deletes for user" err="unexpected status code: 404"

The loki config is below `auth_enabled: false chunk_store_config: chunk_cache_config: embedded_cache: enabled: false memcached: batch_size: 100 expiration: 30m parallelism: 100 memcached_client: consistent_hash: true host: memcached-chunk.loki.svc.cluster.local service: memcached-chunk write_dedupe_cache_config: memcached: batch_size: 100 expiration: 30m parallelism: 100 memcached_client: consistent_hash: true host: memcached-write.loki.svc.cluster.local service: memcached-write common: compactor_address: http://loki-read:3100 path_prefix: /var/loki replication_factor: 1 ring: kvstore: store: memberlist storage: s3: bucketnames: loki-data insecure: false region: eu-central-1 s3forcepathstyle: false compactor: delete_request_store: s3 retention_enabled: true frontend: compress_responses: true log_queries_longer_than: 20s max_outstanding_per_tenant: 4096 frontend_worker: grpc_client_config: max_recv_msg_size: 50331648 max_send_msg_size: 50331648 ingester: chunk_encoding: snappy chunk_idle_period: 15m chunk_retain_period: 30s chunk_target_size: 1572864 max_chunk_age: 1h ingester_client: grpc_client_config: grpc_compression: snappy max_recv_msg_size: 50331648 max_send_msg_size: 50331648 limits_config: ingestion_burst_size_mb: 1000 ingestion_rate_mb: 1000 max_cache_freshness_per_query: 10m max_query_parallelism: 2 max_query_series: 2000 per_stream_rate_limit: 20MB per_stream_rate_limit_burst: 20MB query_timeout: 2m reject_old_samples: true reject_old_samples_max_age: 168h retention_period: 8760h split_queries_by_interval: 15m deletion_mode: filter-and-delete memberlist: join_members:

loki-memberlist querier: max_concurrent: 4096 query_ingesters_within: 2h query_range: align_queries_with_step: true cache_results: true max_retries: 5 results_cache: cache: memcached_client: consistent_hash: true host: memcached-result.loki.svc.cluster.local max_idle_conns: 16 service: memcached-result timeout: 500ms update_interval: 1m query_scheduler: use_scheduler_ring: false ruler: storage: s3: bucketnames: loki-data insecure: false region: eu-central-1 s3forcepathstyle: false type: s3 runtime_config: file: /etc/loki/runtime-config/runtime-config.yaml schema_config: configs:
chunks: period: 24h prefix: chunk from: "2023-02-10" index: period: 24h prefix: index object_store: s3 schema: v11 store: boltdb-shipper
chunks: period: 24h prefix: tsdbchunk from: "2023-02-15" index: period: 24h prefix: tsdbindex object_store: s3 schema: v12 store: tsdb
from: "2024-08-07" index: period: 24h prefix: index_ object_store: s3 schema: v13 store: tsdb server: grpc_listen_port: 9095 grpc_server_max_recv_msg_size: 6291456 grpc_server_max_send_msg_size: 6291456 http_listen_port: 3100 http_server_idle_timeout: 120s http_server_read_timeout: 180s http_server_write_timeout: 180s storage_config: filesystem: null hedging: at: 250ms max_per_second: 20 up_to: 3 index_queries_cache_config: memcached: batch_size: 100 expiration: 240m parallelism: 100 memcached_client: consistent_hash: true host: memcached-index.loki.svc.cluster.local service: memcached-index table_manager: retention_deletes_enabled: true retention_period: 8760h`

duj4 commented 3 months ago

@Hitesh-Agrawal it seems that your error is different from mine, which mode are you using to deploy your Loki stack?

Hitesh-Agrawal commented 2 months ago

@duj4 I am running it in deploymentMode: SimpleScalable. The storage is in aws s3.

duj4 commented 2 months ago

ok, if that is the case, I think you may need to modify your compactor address per https://github.com/grafana/loki/blob/9315b3d03d790506cf8e69fb7407b476de9d0ed6/production/helm/loki/templates/_helpers.tpl#L1000

Hitesh-Agrawal commented 2 months ago

@duj4 The compactor address is already set as per the need , I am not using any backend targets and only have loki-read and loki-write pods with loki-gateway common: compactor_address: http://loki-read:3100/

duj4 commented 2 months ago

@Hitesh-Agrawal ok, so it is a mixed config of SSD and distributed LOL, which is out of my knowledge, sorry man

JStickler commented 1 week ago

Configuration questions have a better chance of being answered if you ask them on the community forums.

grafana / loki

"Error loading cache generation numbers" came out on Loki-read #13756