Open ohtyap opened 1 week ago
Hi @ohtyap can you provide some more information for the team around your deployment:
That should help help us out
@Jayclifford345 thanks a lot, sure here the infos:
Version: We encountered the problem with loki 3.1.0
- we upgraded in the meanwhile to 3.1.1
(happens there too). But not sure if it is really a 3.1.X
problem.
Deployment Type: Simple Scalable (on AWS with S3 storage)
The query and range of data: It happens everytime when loki wants to retrieve affect loglines/chunks 🤷 Basically a {pod="xyz"}
is enough to trigger the error if an "infected" timeframe is selected. In some cases we were able to reduce the affected timeframe to a few seconds. If someone is selecting these timeframe (by even just selecting 1sec), the crash happens.
The loki config:
loki:
server:
grpc_server_max_recv_msg_size: 1.048576e+08
grpc_server_max_send_msg_size: 1.048576e+08
storage:
type: 's3'
s3:
region: XYZ
bucketNames:
chunks: XYZ
ruler: XYZ
admin: XYZ
limits_config:
max_line_size: 512KB
max_query_length: 0
ingestion_rate_mb: 24
ingestion_burst_size_mb: 32
reject_old_samples_max_age: 30d
split_queries_by_interval: 1h
query_timeout: 10m
tsdb_max_query_parallelism: 100
retention_period: 90d
retention_stream:
- selector: '{channel="XYZ"}'
period: 372d
priority: 100
schemaConfig:
configs:
- from: 2024-04-30
store: tsdb
object_store: aws
schema: v13
index:
prefix: index_
period: 24h
querier:
max_concurrent: 16
query_scheduler:
max_outstanding_requests_per_tenant: 3276
compactor:
retention_enabled: true
working_directory: /retention
delete_request_store: s3
retention_delete_delay: 2h
retention_delete_worker_count: 150
compaction_interval: 10m
ingester:
chunk_encoding: snappy
autoforget_unhealthy: true
I can deterministically reproduce this, and I get the following error in all querier right before the crash loop: panic: runtime error: slice bounds out of range [:837727] with capacity 760512
goroutine 12813 [running]: github.com/grafana/loki/v3/pkg/chunkenc.newByteChunk({0xc02f7001f8, 0xb98c0, 0xb9ac0}, 0x0, 0x0, 0x0) /src/loki/pkg/chunkenc/memchunk.go:479 +0xdbe github.com/grafana/loki/v3/pkg/chunkenc.NewByteChunk(...) /src/loki/pkg/chunkenc/memchunk.go:392 github.com/grafana/loki/v3/pkg/chunkenc.(Facade).UnmarshalFromBuf(0xc02f04a2d0, {0xc02f7001f8?, 0x3260580?, 0x4944280?> /src/loki/pkg/chunkenc/facade.go:64 +0x34 github.com/grafana/loki/v3/pkg/storage/chunk.(Chunk).Decode(0xc02d1b7b10, 0xc00079d570, {0xc02f700000, 0xb9ab8, 0xb9cb> /src/loki/pkg/storage/chunk/chunk.go:359 +0x5a2 github.com/grafana/loki/v3/pkg/storage/chunk/client.(*client).getChunk(, {, }, , {{0x4246a33a92bd0a8b, {0xc02f04c96> /src/loki/pkg/storage/chunk/client/object_client.go:187 +0x34f github.com/grafana/loki/v3/pkg/storage/chunk/client/util.GetParallelChunks.func2() /src/loki/pkg/storage/chunk/client/util/parallel_chunk_fetch.go:48 +0x1c4 created by github.com/grafana/loki/v3/pkg/storage/chunk/client/util.GetParallelChunks in goroutine 12716 /src/loki/pkg/storage/chunk/client/util/parallel_chunk_fetch.go:45 +0x485
sum(count_over_time( { service_name="logstash-loki-pipeline" , loki_log_group="k8s-pod" } | logfmt [1m])) by(namespace)
on a 16 hour range it happens on 8 hour range as well, but not on 4 hour range, maybe not the range is what counts, but the data in the extra range.
The version we use: b4f7181c7 (HEAD, tag: v3.0.0)
Thank you all for the extra information. I will raise this at the Loki engineer call tomorrow.
Thank You! I narrowed down the query range to 1s and the over_time interval to 10ms, and some of the querier produced this panic, however when just 1 or 2 is panicking, grafana doesn' show context cancelled on the ui, so it surely relates somechunks in that interval.
@Jayclifford345 Is there anything we can help with - like providing additional information or something specific to check?
Hi @ohtyap, sorry for the late reply we sadly didn't have the engineer call this week since its focus week. Will make sure the team is aware on Monday to take a look.
@Jayclifford345 No worries; I suspected it will be hard to re-produce. So please, ping me in case we can check or try something or any other way to help out on this one. But I hope there is a solution or a fix, as otherwise, loki is sadly not usable for us (because this bug makes the usage quite unreliable - broken dashboards and alarms etc.) 🤞
Describe the bug When querying certain timeframes within loki,
loki-read
is crashing (see stacktrace output below). Sadly, I do not have more info than the stacktrace - I was not able to see any pattern when this happens. It happens for different labels (e.g. completly different types of logs), different times etc. Any idea how to debug this further?To Reproduce I am not able to deliberately reproduce the issue ... it just happens from time to time (and basically completly breaks the usage of loki)
Expected behavior loki not crashing 😅
Environment:
Screenshots, Promtail config, or terminal output