grafana / loki

Like Prometheus, but for logs.
https://grafana.com/loki
GNU Affero General Public License v3.0
23.59k stars 3.41k forks source link

runtime error: slice bounds out of range #14345

Open ohtyap opened 1 week ago

ohtyap commented 1 week ago

Describe the bug When querying certain timeframes within loki, loki-read is crashing (see stacktrace output below). Sadly, I do not have more info than the stacktrace - I was not able to see any pattern when this happens. It happens for different labels (e.g. completly different types of logs), different times etc. Any idea how to debug this further?

To Reproduce I am not able to deliberately reproduce the issue ... it just happens from time to time (and basically completly breaks the usage of loki)

Expected behavior loki not crashing 😅

Environment:

Screenshots, Promtail config, or terminal output


2024-10-01 16:12:17.196 | /src/loki/pkg/storage/chunk/client/util/parallel_chunk_fetch.go:45 +0x485
2024-10-01 16:12:17.196 | created by github.com/grafana/loki/v3/pkg/storage/chunk/client/util.GetParallelChunks in goroutine 11537
2024-10-01 16:12:17.196 | /src/loki/pkg/storage/chunk/client/util/parallel_chunk_fetch.go:48 +0x1de
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/storage/chunk/client/util.GetParallelChunks.func2()
2024-10-01 16:12:17.196 | /src/loki/pkg/storage/chunk/client/object_client.go:187 +0x5ca
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/storage/chunk/client.(*client).getChunk(_, {_, _}, _, {{0x33f196330e3b21ab, {0xc00b1cf338, 0x4}, 0x19247502928, 0x1924761e382, 0x2e112952}, ...})
2024-10-01 16:12:17.196 | /src/loki/pkg/storage/chunk/chunk.go:359 +0x596
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/storage/chunk.(*Chunk).Decode(0xc00f757b10, 0xc00ab8dd38, {0xc039b80000, 0x14215f, 0x14235f})
2024-10-01 16:12:17.196 | /src/loki/pkg/chunkenc/facade.go:64 +0x34
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/chunkenc.(*Facade).UnmarshalFromBuf(0xc00e86f4a0, {0xc039b8019a?, 0x35a2f00?, 0x4da0e40?})
2024-10-01 16:12:17.196 | /src/loki/pkg/chunkenc/memchunk.go:392
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/chunkenc.NewByteChunk(...)
2024-10-01 16:12:17.196 | /src/loki/pkg/chunkenc/memchunk.go:479 +0xd9e
2024-10-01 16:12:17.196 | github.com/grafana/loki/v3/pkg/chunkenc.newByteChunk({0xc039b8019a, 0x141fc5, 0x1421c5}, 0x0, 0x0, 0x0)
2024-10-01 16:12:17.196 | goroutine 11486 [running]:
2024-10-01 16:12:17.196 | 
2024-10-01 16:12:17.196 | panic: runtime error: slice bounds out of range [:1360462] with capacity 1319365
Jayclifford345 commented 1 week ago

Hi @ohtyap can you provide some more information for the team around your deployment:

That should help help us out

ohtyap commented 1 week ago

@Jayclifford345 thanks a lot, sure here the infos:

Version: We encountered the problem with loki 3.1.0 - we upgraded in the meanwhile to 3.1.1 (happens there too). But not sure if it is really a 3.1.X problem.

Deployment Type: Simple Scalable (on AWS with S3 storage)

The query and range of data: It happens everytime when loki wants to retrieve affect loglines/chunks 🤷 Basically a {pod="xyz"} is enough to trigger the error if an "infected" timeframe is selected. In some cases we were able to reduce the affected timeframe to a few seconds. If someone is selecting these timeframe (by even just selecting 1sec), the crash happens.

The loki config:

loki:
  server:
    grpc_server_max_recv_msg_size: 1.048576e+08
    grpc_server_max_send_msg_size: 1.048576e+08
  storage:
    type: 's3'
    s3:
      region: XYZ
    bucketNames:
      chunks: XYZ
      ruler: XYZ
      admin: XYZ
  limits_config:
    max_line_size: 512KB
    max_query_length: 0
    ingestion_rate_mb: 24
    ingestion_burst_size_mb: 32
    reject_old_samples_max_age: 30d
    split_queries_by_interval: 1h
    query_timeout: 10m
    tsdb_max_query_parallelism: 100
    retention_period: 90d
    retention_stream:
      - selector: '{channel="XYZ"}'
        period: 372d
        priority: 100

  schemaConfig:
    configs:
      - from: 2024-04-30
        store: tsdb
        object_store: aws
        schema: v13
        index:
          prefix: index_
          period: 24h
  querier:
    max_concurrent: 16

  query_scheduler:
    max_outstanding_requests_per_tenant: 3276

  compactor:
    retention_enabled: true
    working_directory: /retention
    delete_request_store: s3
    retention_delete_delay: 2h
    retention_delete_worker_count: 150
    compaction_interval: 10m

  ingester:
    chunk_encoding: snappy
    autoforget_unhealthy: true
axelbodo commented 1 week ago

I can deterministically reproduce this, and I get the following error in all querier right before the crash loop: panic: runtime error: slice bounds out of range [:837727] with capacity 760512

goroutine 12813 [running]: github.com/grafana/loki/v3/pkg/chunkenc.newByteChunk({0xc02f7001f8, 0xb98c0, 0xb9ac0}, 0x0, 0x0, 0x0) /src/loki/pkg/chunkenc/memchunk.go:479 +0xdbe github.com/grafana/loki/v3/pkg/chunkenc.NewByteChunk(...) /src/loki/pkg/chunkenc/memchunk.go:392 github.com/grafana/loki/v3/pkg/chunkenc.(Facade).UnmarshalFromBuf(0xc02f04a2d0, {0xc02f7001f8?, 0x3260580?, 0x4944280?> /src/loki/pkg/chunkenc/facade.go:64 +0x34 github.com/grafana/loki/v3/pkg/storage/chunk.(Chunk).Decode(0xc02d1b7b10, 0xc00079d570, {0xc02f700000, 0xb9ab8, 0xb9cb> /src/loki/pkg/storage/chunk/chunk.go:359 +0x5a2 github.com/grafana/loki/v3/pkg/storage/chunk/client.(*client).getChunk(, {, }, , {{0x4246a33a92bd0a8b, {0xc02f04c96> /src/loki/pkg/storage/chunk/client/object_client.go:187 +0x34f github.com/grafana/loki/v3/pkg/storage/chunk/client/util.GetParallelChunks.func2() /src/loki/pkg/storage/chunk/client/util/parallel_chunk_fetch.go:48 +0x1c4 created by github.com/grafana/loki/v3/pkg/storage/chunk/client/util.GetParallelChunks in goroutine 12716 /src/loki/pkg/storage/chunk/client/util/parallel_chunk_fetch.go:45 +0x485

sum(count_over_time( { service_name="logstash-loki-pipeline" , loki_log_group="k8s-pod" } | logfmt [1m])) by(namespace)

on a 16 hour range it happens on 8 hour range as well, but not on 4 hour range, maybe not the range is what counts, but the data in the extra range.

The version we use: b4f7181c7 (HEAD, tag: v3.0.0)

Jayclifford345 commented 6 days ago

Thank you all for the extra information. I will raise this at the Loki engineer call tomorrow.

axelbodo commented 3 days ago

Thank You! I narrowed down the query range to 1s and the over_time interval to 10ms, and some of the querier produced this panic, however when just 1 or 2 is panicking, grafana doesn' show context cancelled on the ui, so it surely relates somechunks in that interval.

ohtyap commented 2 days ago

@Jayclifford345 Is there anything we can help with - like providing additional information or something specific to check?

Jayclifford345 commented 2 days ago

Hi @ohtyap, sorry for the late reply we sadly didn't have the engineer call this week since its focus week. Will make sure the team is aware on Monday to take a look.

ohtyap commented 2 days ago

@Jayclifford345 No worries; I suspected it will be hard to re-produce. So please, ping me in case we can check or try something or any other way to help out on this one. But I hope there is a solution or a fix, as otherwise, loki is sadly not usable for us (because this bug makes the usage quite unreliable - broken dashboards and alarms etc.) 🤞