grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.84k stars 3.44k forks source link

Loki returns 503 when recieved logs from OtelCollector #9591

Open ghost opened 1 year ago

ghost commented 1 year ago

Hi Team, I am having trouble with Loki.

I want to use Loki with docker-compose on ubuntu 22.04. Everything is fine until Loki starts up, but when I receive logs from OpenTelemetry collector Loki returns a 503 error. I would like to know why the 503 error occurs and how to solve it.

docker-compose.yaml ``` yaml version: '3' services: loki: image: grafana/loki container_name: loki restart: always volumes: - ./loki/local-config.yaml:/etc/loki/local-config.yaml command: - "-config.file=/etc/loki/local-config.yaml" ports: - 3100:3100 networks: - my-network otelcollector: image: otel/opentelemetry-collector-contrib container_name: otelcollector restart: always ports: - 4317:4317 # gRPC port - 8887:8887 # HTTP port - 9200:9200 volumes: - ./otelcollector:/etc/otelcollector command: - "--config=/etc/otelcollector/otel-collector-config.yaml" networks: - my-network networks: my-network: driver: bridge ```
loki/local-config.yaml ``` yaml auth_enabled: false server: http_listen_port: 3100 grpc_listen_port: 9096 common: instance_addr: 127.0.0.1 path_prefix: /tmp/loki storage: filesystem: chunks_directory: /tmp/loki/chunks rules_directory: /tmp/loki/rules replication_factor: 1 ring: kvstore: store: inmemory query_range: results_cache: cache: embedded_cache: enabled: true max_size_mb: 100 schema_config: configs: - from: 2020-10-24 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h ruler: alertmanager_url: http://localhost:9093 # By default, Loki will send anonymous, but uniquely-identifiable usage and configuration # analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/ # # Statistics help us better understand how Loki is used, and they show us performance # levels for most users. This helps us prioritize features and documentation. # For more information on what's sent, look at # https://github.com/grafana/loki/blob/main/pkg/usagestats/stats.go # Refer to the buildReport method to see what goes into a report. # # If you would like to disable reporting, uncomment the following lines: #analytics: # reporting_enabled: false ```
otelcollector/otel-collector-config.yaml ``` yaml receivers: otlp: protocols: grpc: endpoint: otelcollector:4317 http: endpoint: otelcollector:8887 exporters: loki: endpoint: "http://loki:3100/loki/api/v1/push" processors: batch: service: pipelines: logs: receivers: [otlp] exporters: [loki] processors: [batch] ```

Logs

Attaching to loki, otelcollector
otelcollector  | 2023-06-01T06:08:33.701Z       info    service/telemetry.go:104        Setting up own telemetry...
otelcollector  | 2023-06-01T06:08:33.702Z       info    service/telemetry.go:127        Serving Prometheus metrics      {"address": ":8888", "level": "Basic"}
otelcollector  | 2023-06-01T06:08:33.702Z       info    lokiexporter@v0.78.0/exporter.go:38     using the new Loki exporter     {"kind": "exporter", "data_type": "logs", "name": "loki"}
otelcollector  | 2023-06-01T06:08:33.702Z       info    service/service.go:131  Starting otelcol-contrib...     {"Version": "0.78.0", "NumCPU": 4}
otelcollector  | 2023-06-01T06:08:33.702Z       info    extensions/extensions.go:30     Starting extensions...
otelcollector  | 2023-06-01T06:08:33.703Z       info    otlpreceiver@v0.78.2/otlp.go:83 Starting GRPC server    {"kind": "receiver", "name": "otlp", "data_type": "metrics", "endpoint": "otelcollector:4317"}
otelcollector  | 2023-06-01T06:08:33.704Z       info    otlpreceiver@v0.78.2/otlp.go:101        Starting HTTP server    {"kind": "receiver", "name": "otlp", "data_type": "metrics", "endpoint": "otelcollector:8887"}
otelcollector  | 2023-06-01T06:08:33.704Z       info    service/service.go:148  Everything is ready. Begin running and processing data.
loki           | level=warn ts=2023-06-01T06:08:33.7396674Z caller=loki.go:286 msg="per-tenant timeout not configured, using default engine timeout (\"5m0s\"). This behavior will change in the next major to always use the default per-tenant timeout (\"5m\")."
loki           | level=info ts=2023-06-01T06:08:33.7424693Z caller=main.go:108 msg="Starting Loki" version="(version=2.8.2, branch=HEAD, revision=9f809eda7)"
loki           | level=info ts=2023-06-01T06:08:33.7441867Z caller=server.go:323 http=[::]:3100 grpc=[::]:9096 msg="server listening on addresses"
loki           | level=warn ts=2023-06-01T06:08:33.7473307Z caller=cache.go:114 msg="fifocache config is deprecated. use embedded-cache instead"
loki           | level=warn ts=2023-06-01T06:08:33.7473499Z caller=experimental.go:20 msg="experimental feature in use" feature="In-memory (FIFO) cache - chunksembedded-cache"
loki           | level=info ts=2023-06-01T06:08:33.7482197Z caller=table_manager.go:404 msg="loading local table index_19509"
loki           | level=info ts=2023-06-01T06:08:33.7480518Z caller=table_manager.go:134 msg="uploading tables"
loki           | ts=2023-06-01T06:08:33.7486009Z caller=spanlogger.go:85 level=info msg="building index list cache"
loki           | ts=2023-06-01T06:08:33.7487229Z caller=spanlogger.go:85 level=info msg="index list cache built" duration=38.7µs
loki           | level=info ts=2023-06-01T06:08:33.7488627Z caller=table_manager.go:262 msg="query readiness setup completed" duration=2.3µs distinct_users_len=0
loki           | level=info ts=2023-06-01T06:08:33.7488919Z caller=shipper.go:131 msg="starting index shipper in RW mode"
loki           | level=info ts=2023-06-01T06:08:33.7498178Z caller=shipper_index_client.go:78 msg="starting boltdb shipper in RW mode"
loki           | level=info ts=2023-06-01T06:08:33.7498623Z caller=table_manager.go:166 msg="handing over indexes to shipper"
loki           | level=info ts=2023-06-01T06:08:33.750538Z caller=worker.go:112 msg="Starting querier worker using query-scheduler and scheduler ring for addresses"
loki           | level=info ts=2023-06-01T06:08:33.7519757Z caller=mapper.go:47 msg="cleaning up mapped rules directory" path=/tmp/loki/rules-temp
loki           | level=info ts=2023-06-01T06:08:33.7541099Z caller=module_service.go:82 msg=initialising module=cache-generation-loader
loki           | level=info ts=2023-06-01T06:08:33.7542565Z caller=module_service.go:82 msg=initialising module=server
loki           | level=info ts=2023-06-01T06:08:33.7544364Z caller=module_service.go:82 msg=initialising module=query-frontend-tripperware
loki           | level=info ts=2023-06-01T06:08:33.7544695Z caller=module_service.go:82 msg=initialising module=memberlist-kv
loki           | level=info ts=2023-06-01T06:08:33.7544772Z caller=module_service.go:82 msg=initialising module=ring
loki           | level=info ts=2023-06-01T06:08:33.7546238Z caller=module_service.go:82 msg=initialising module=store
loki           | level=info ts=2023-06-01T06:08:33.7545296Z caller=ring.go:263 msg="ring doesn't exist in KV store yet"
loki           | level=info ts=2023-06-01T06:08:33.7550028Z caller=client.go:255 msg="value is nil" key=collectors/ring index=1
loki           | level=info ts=2023-06-01T06:08:33.7550159Z caller=module_service.go:82 msg=initialising module=usage-report
loki           | level=info ts=2023-06-01T06:08:33.7553057Z caller=module_service.go:82 msg=initialising module=ingester-querier
loki           | level=info ts=2023-06-01T06:08:33.7553631Z caller=module_service.go:82 msg=initialising module=query-scheduler
loki           | level=info ts=2023-06-01T06:08:33.7553681Z caller=module_service.go:82 msg=initialising module=distributor
loki           | level=info ts=2023-06-01T06:08:33.7553942Z caller=module_service.go:82 msg=initialising module=compactor
loki           | level=info ts=2023-06-01T06:08:33.7554746Z caller=ring.go:263 msg="ring doesn't exist in KV store yet"
loki           | level=info ts=2023-06-01T06:08:33.7554928Z caller=client.go:255 msg="value is nil" key=collectors/scheduler index=1
loki           | level=info ts=2023-06-01T06:08:33.7555049Z caller=module_service.go:82 msg=initialising module=ingester
loki           | level=info ts=2023-06-01T06:08:33.7555221Z caller=ingester.go:416 msg="recovering from checkpoint"
loki           | level=info ts=2023-06-01T06:08:33.7555595Z caller=recovery.go:40 msg="no checkpoint found, treating as no-op"
loki           | level=info ts=2023-06-01T06:08:33.7555919Z caller=module_service.go:82 msg=initialising module=ruler
loki           | level=info ts=2023-06-01T06:08:33.7556063Z caller=ruler.go:499 msg="ruler up and running"
loki           | level=info ts=2023-06-01T06:08:33.7556283Z caller=ingester.go:432 msg="recovered WAL checkpoint recovery finished" elapsed=117.4µs errors=false
loki           | level=info ts=2023-06-01T06:08:33.755632Z caller=lifecycler.go:547 msg="not loading tokens from file, tokens file path is empty"
loki           | level=info ts=2023-06-01T06:08:33.7556339Z caller=ingester.go:438 msg="recovering from WAL"
loki           | level=info ts=2023-06-01T06:08:33.7559892Z caller=basic_lifecycler.go:261 msg="instance not found in the ring" instance=d088725449eb ring=scheduler
loki           | level=info ts=2023-06-01T06:08:33.7559955Z caller=basic_lifecycler_delegates.go:63 msg="not loading tokens from file, tokens file path is empty"
loki           | level=info ts=2023-06-01T06:08:33.756104Z caller=scheduler.go:616 msg="waiting until scheduler is JOINING in the ring"
loki           | level=info ts=2023-06-01T06:08:33.7561096Z caller=scheduler.go:620 msg="scheduler is JOINING in the ring"
loki           | level=info ts=2023-06-01T06:08:33.7561236Z caller=scheduler.go:630 msg="waiting until scheduler is ACTIVE in the ring"
loki           | level=info ts=2023-06-01T06:08:33.7561494Z caller=basic_lifecycler.go:261 msg="instance not found in the ring" instance=d088725449eb ring=compactor
loki           | level=info ts=2023-06-01T06:08:33.7561539Z caller=basic_lifecycler_delegates.go:63 msg="not loading tokens from file, tokens file path is empty"
loki           | level=info ts=2023-06-01T06:08:33.7562026Z caller=lifecycler.go:576 msg="instance not found in ring, adding with no tokens" ring=distributor
loki           | level=info ts=2023-06-01T06:08:33.7562336Z caller=lifecycler.go:416 msg="auto-joining cluster after timeout" ring=distributor
loki           | level=info ts=2023-06-01T06:08:33.7562377Z caller=ring.go:263 msg="ring doesn't exist in KV store yet"
loki           | level=info ts=2023-06-01T06:08:33.7562651Z caller=compactor.go:332 msg="waiting until compactor is JOINING in the ring"
loki           | level=info ts=2023-06-01T06:08:33.7565054Z caller=ingester.go:454 msg="WAL segment recovery finished" elapsed=994.2µs errors=false
loki           | level=info ts=2023-06-01T06:08:33.7565184Z caller=ingester.go:402 msg="closing recoverer"
loki           | level=info ts=2023-06-01T06:08:33.7565223Z caller=ingester.go:410 msg="WAL recovery finished" time=1.0111ms
loki           | level=info ts=2023-06-01T06:08:33.7566002Z caller=lifecycler.go:547 msg="not loading tokens from file, tokens file path is empty"
loki           | level=info ts=2023-06-01T06:08:33.7566201Z caller=lifecycler.go:576 msg="instance not found in ring, adding with no tokens" ring=ingester
loki           | level=info ts=2023-06-01T06:08:33.7567227Z caller=lifecycler.go:416 msg="auto-joining cluster after timeout" ring=ingester
loki           | level=info ts=2023-06-01T06:08:33.7567752Z caller=wal.go:156 msg=started component=wal
loki           | level=info ts=2023-06-01T06:08:33.8590963Z caller=compactor.go:336 msg="compactor is JOINING in the ring"
loki           | level=info ts=2023-06-01T06:08:33.8595882Z caller=compactor.go:346 msg="waiting until compactor is ACTIVE in the ring"
loki           | level=info ts=2023-06-01T06:08:33.8653174Z caller=scheduler.go:634 msg="scheduler is ACTIVE in the ring"
loki           | level=info ts=2023-06-01T06:08:33.8654611Z caller=module_service.go:82 msg=initialising module=querier
loki           | level=info ts=2023-06-01T06:08:33.8654965Z caller=module_service.go:82 msg=initialising module=query-frontend
loki           | level=info ts=2023-06-01T06:08:34.0419808Z caller=compactor.go:350 msg="compactor is ACTIVE in the ring"
loki           | level=info ts=2023-06-01T06:08:34.0434821Z caller=loki.go:499 msg="Loki started"
loki           | level=info ts=2023-06-01T06:08:36.8666389Z caller=scheduler.go:681 msg="this scheduler is in the ReplicationSet, will now accept requests."
loki           | level=info ts=2023-06-01T06:08:36.8813073Z caller=worker.go:209 msg="adding connection" addr=127.0.0.1:9096
loki           | level=info ts=2023-06-01T06:08:39.044454Z caller=compactor.go:411 msg="this instance has been chosen to run the compactor, starting compactor"
loki           | level=info ts=2023-06-01T06:08:39.0449099Z caller=compactor.go:440 msg="waiting 10m0s for ring to stay stable and previous compactions to finish before starting compactor"
loki           | level=info ts=2023-06-01T06:08:43.8663465Z caller=frontend_scheduler_worker.go:107 msg="adding connection to scheduler" addr=127.0.0.1:9096
otelcollector  | 2023-06-01T06:09:24.554Z       info    exporterhelper/queued_retry.go:423      Exporting failed. Will retry the request after interval.        {"kind": "exporter", "data_type": "logs", "name": "loki", "error": "HTTP 503 \"Service Unavailable\": ", "interval": "5.829304448s"}
loki           | level=info ts=2023-06-01T06:09:33.7485215Z caller=table_manager.go:134 msg="uploading tables"
loki           | level=info ts=2023-06-01T06:09:33.751003Z caller=table_manager.go:166 msg="handing over indexes to shipper"
otelcollector  | 2023-06-01T06:09:38.348Z       info    exporterhelper/queued_retry.go:423      Exporting failed. Will retry the request after interval.        {"kind": "exporter", "data_type": "logs", "name": "loki", "error": "HTTP 503 \"Service Unavailable\": ", "interval": "14.438544477s"}
otelcollector  | 2023-06-01T06:09:52.793Z       info    exporterhelper/queued_retry.go:423      Exporting failed. Will retry the request after interval.        {"kind": "exporter", "data_type": "logs", "name": "loki", "error": "HTTP 503 \"Service Unavailable\": ", "interval": "10.34424494s"}
rehnerik commented 5 months ago

@aoxt323 I'm experiencing the same issue. Did you solve it?

ghost commented 4 months ago

@rehnerik I was in a hurry, so I recreated the environment and the same problem no longer occurred.

Vinaum8 commented 2 months ago

+1

L-Henke commented 1 month ago

I was able to fix the same issue by limiting the batch size for the opentelemetry collector. I had this issue with tenant with a high log volume and resulting large batch sizes while the other ones were handled just fine.

dmfrey commented 1 month ago

@L-Henke can you be more specific on what you changed? I started experiencing this as well.

L-Henke commented 1 month ago

I reduced the send_batch_size of the opentelemetry collector batchprocessor from the default 8192 to something like 1000. I also set send_batch_max_size to the same value.

mveitas commented 2 weeks ago

We are see intermittent 503 returned by the gateway and unable to track them down. We did turn down the batch size from the collector and that seems to have helped out, but still getting random bursts of 503s for 2-3 minutes at a time. There isn't much feedback from the internal Loki logs to figure out why this 503 is being returned.