We are experiencing significant issues with our current Loki distributed setup, specifically with the ingesters' Persistent Volume Claims (PVCs) filling up almost immediately. Additionally, we are encountering frequent "SlowDown" errors from our S3 backend, indicating excessive request rates. Below is a detailed description of our setup and the observed errors, along with a request for suggestions on improving the configuration.
Setup Details
Loki Version: 2.9.4
chart version 0.79.1
Deployment Type: Loki Distributed
Number of Ingesters: 5
PVC Size per Ingester: 90GB
Configuration
level=error ts=2024-07-11T17:50:00.996031395Z caller=flush.go:143 org_id=fake msg="failed to flush" err="failed to flush chunks: store put chunk: SlowDown: Please reduce your request rate.\n\tstatus code: 503, request id: XXXXXX, host id: XXXXXXXXXXXXXXXX, num_chunks: 3, labels: {app=\"Abc\", cluster=\"prod\", component=\"ingester\", container=\"ingester\", filename=\"/var/log/pods/mimir-xxxxxxx-3_XXXXXX/ingester/0.log\", instance=\"mimir-production\", job=\"mimir-production/mimir\", namespace=\"xxxx\", node_name=\"ip-10-XX-XX-XX.ec2.internal\", pod=\"mimir-xxxx-xxxxxx\", stream=\"stderr\"}"
Current Issues
Immediate filling of ingesters' PVCs: This leads to storage issues and potential data loss.
Frequent "SlowDown" errors from the S3 backend: These errors indicate that our request rate is too high for the S3 service.
is there any issue with loki-distributed 2.9.4 ?
in which version this issue got fixed
We are experiencing significant issues with our current Loki distributed setup, specifically with the ingesters' Persistent Volume Claims (PVCs) filling up almost immediately. Additionally, we are encountering frequent "SlowDown" errors from our S3 backend, indicating excessive request rates. Below is a detailed description of our setup and the observed errors, along with a request for suggestions on improving the configuration.
Setup Details Loki Version: 2.9.4 chart version 0.79.1 Deployment Type: Loki Distributed Number of Ingesters: 5 PVC Size per Ingester: 90GB Configuration
Observed Errors
level=error ts=2024-07-11T17:50:00.996031395Z caller=flush.go:143 org_id=fake msg="failed to flush" err="failed to flush chunks: store put chunk: SlowDown: Please reduce your request rate.\n\tstatus code: 503, request id: XXXXXX, host id: XXXXXXXXXXXXXXXX, num_chunks: 3, labels: {app=\"Abc\", cluster=\"prod\", component=\"ingester\", container=\"ingester\", filename=\"/var/log/pods/mimir-xxxxxxx-3_XXXXXX/ingester/0.log\", instance=\"mimir-production\", job=\"mimir-production/mimir\", namespace=\"xxxx\", node_name=\"ip-10-XX-XX-XX.ec2.internal\", pod=\"mimir-xxxx-xxxxxx\", stream=\"stderr\"}"
Current Issues Immediate filling of ingesters' PVCs: This leads to storage issues and potential data loss. Frequent "SlowDown" errors from the S3 backend: These errors indicate that our request rate is too high for the S3 service.is there any issue with loki-distributed 2.9.4 ? in which version this issue got fixed
following issues seem to be similar for loki-distributed 2.9.4 https://github.com/grafana/loki/pull/11776 https://github.com/grafana/loki/pull/12456 https://github.com/grafana/loki/pull/12456
https://grafana.slack.com/archives/CEPJRLQNL/p1715165059349319
is latest loki-distributed with 2.9.8 will fix this issue ?