thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.73k stars 2.04k forks source link

Thanos Store Panic #6934

Open nicolastakashi opened 7 months ago

nicolastakashi commented 7 months ago

Thanos, Prometheus and Golang version used:

Thanos: 0.32.5 Prometheus: 2.48

Object Storage Provider: AWS S3

What happened:

Using Thanos Store with Time Filter from time to time I'm it's crashing with the following message.

panic: runtime error: slice bounds out of range [:16000] with capacity 1282

What you expected to happen:

Not crash

How to reproduce it (as minimally and precisely as possible):

Actually! I have no idea, this is only happening in one shard.

Full logs to relevant components:

panic: runtime error: slice bounds out of range [:16000] with capacity 1290goroutine 507203 [running]:github.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).loadChunks(0x400261e720, {0x26b43f8, 0x407d9c7540}, {0x406284b000, 0x10c, 0x400195e870?}, {0x4102db7748, 0x1, 0x2}, 0x3968440?, ...)   /app/pkg/store/bucket.go:3342 +0xe1cgithub.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).load.func4() /app/pkg/store/bucket.go:3270 +0x74golang.org/x/sync/errgroup.(*Group).Go.func1()   /go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75 +0x58created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 505525 /go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:72 +0x98

Anything else we need to know:

yeya24 commented 7 months ago

@nicolastakashi Can you share the actual panic stacktrace? I think I saw something similar last week

MichaHoffmann commented 7 months ago

We have:

    bufPooled, err := r.block.chunkPool.Get(r.block.estimatedMaxChunkSize)
    if err == nil {
        buf = *bufPooled
    } else {
        buf = make([]byte, r.block.estimatedMaxChunkSize)
    }

and the crash happens here few lines later

        chunkLen = r.block.estimatedMaxChunkSize
        if i+1 < len(pIdxs) {
            if diff = pIdxs[i+1].offset - pIdx.offset; int(diff) < chunkLen {
                chunkLen = int(diff)
            }
        }
->          cb := buf[:chunkLen]
MichaHoffmann commented 7 months ago

Is the pool returning too small size slices?

nicolastakashi commented 7 months ago

I rolled back to 0.31 and seems it's not happening on this version

jon-rei commented 7 months ago

This is also happening to us: Thanos: v0.32.5 (deployed through bitnami helm chart) Prometheus: v2.45.1

The problem only occurs on queries which are a little bit older and only rely on the store s3 data.

panic: runtime error: slice bounds out of range [:16000] with capacity 2422
goroutine 2641055 [running]:
github.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).loadChunks(0xc24d3caae0, {0x2d0fc20, 0xc338675680}, {0xc362b24000, 0x9, 0x0?}, {0xc26a0f8f80, 0x1, 0x2}, 0x1, ...)
/bitnami/blacksmith-sandox/thanos-0.32.5/src/github.com/thanos-io/thanos/pkg/store/bucket.go:3342 +0x11e5
github.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).load.func4()
/bitnami/blacksmith-sandox/thanos-0.32.5/src/github.com/thanos-io/thanos/pkg/store/bucket.go:3270 +0xff
golang.org/x/sync/errgroup.(*Group).Go.func1()
/bitnami/blacksmith-sandox/thanos-0.32.5/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
/bitnami/blacksmith-sandox/thanos-0.32.5/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:72 +0xa5

I can also provide traces if that would be helpful.

MichaHoffmann commented 7 months ago

how is storage gateway configured?

nicolastakashi commented 7 months ago

@MichaHoffmann This is mine.

- store
- '--log.level=warn'
- '--log.format=json'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
- '--data-dir=/data'
- '--objstore.config-file=/conf/objstore.yml'
- '--index-cache.config-file=/conf/index-cache.yml'
- '--max-time=-48h'
- '--min-time=-720h'
- '--grpc-grace-period=5s'
- '--store.enable-index-header-lazy-reader'
jon-rei commented 7 months ago

This is our config:

- store
- '--log.level=info'
- '--log.format=logfmt'
- '--grpc-address=0.0.0.0:10901'
- '--http-address=0.0.0.0:10902'
- '--data-dir=/data'
- '--objstore.config-file=/conf/objstore.yml'
- |
  --tracing.config=type: OTLP
  config:
    client_type: grpc
    service_name: "thanos-storegateway"
    endpoint: 127.0.0.1:4317
    insecure: true
    compression: gzip
pauloconnor commented 3 months ago

We're also seeing this issue on one of our storagegateways sets, out of 6

Prometheus: v2.49.1 Thanos sidecar: v0.28.1 Thanos: v0.34.0

panic: runtime error: slice bounds out of range [:1027] with capacity 1024

goroutine 303830 [running]:
github.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).loadChunks(0xc06bb0d980, {0x35ee848, 0xc1421085a0}, {0xc1d4e00000, 0x1849, 0x100000001?}, {0xc07006e2f8, 0x1, 0x2}, 0x2, ...)
    /bitnami/blacksmith-sandox/thanos-0.34.0/src/github.com/thanos-io/thanos/pkg/store/bucket.go:3532 +0x1118
github.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).load.func4()
    /bitnami/blacksmith-sandox/thanos-0.34.0/src/github.com/thanos-io/thanos/pkg/store/bucket.go:3466 +0x11b
golang.org/x/sync/errgroup.(*Group).Go.func1()
    /bitnami/blacksmith-sandox/thanos-0.34.0/pkg/mod/golang.org/x/sync@v0.5.0/errgroup/errgroup.go:75 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 295848
    /bitnami/blacksmith-sandox/thanos-0.34.0/pkg/mod/golang.org/x/sync@v0.5.0/errgroup/errgroup.go:72 +0x96
- name: storegateway
          image: docker.io/bitnami/thanos:0.34.0-debian-11-r0
          args:
            - store
            - '--log.level=info'
            - '--log.format=logfmt'
            - '--grpc-address=0.0.0.0:10901'
            - '--http-address=0.0.0.0:10902'
            - '--data-dir=/data'
            - '--objstore.config-file=/conf/objstore.yml'
nicolastakashi commented 3 months ago

Still seeing this issue on 0.33.0 panic: runtime error: slice bounds out of range [:16000] with capacity 1249goroutine 15326510 [running]:github.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).loadChunks(0x40032faae0, {0x26c6c28, 0x40248fcf00}, {0x4023fe8000, 0x94, 0x2bf1bc?}, {0x41ced43ba8, 0x1, 0x2}, 0x3997b40?, ...) /app/pkg/store/bucket.go:3504 +0xd18github.com/thanos-io/thanos/pkg/store.(*bucketChunkReader).load.func4() /app/pkg/store/bucket.go:3438 +0x84golang.org/x/sync/errgroup.(*Group).Go.func1() /go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:75 +0x58created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 15326102 /go/pkg/mod/golang.org/x/sync@v0.3.0/errgroup/errgroup.go:72 +0x98

martinfrycs1 commented 3 months ago

So far we fixed it ourself in pkg.store, line 3532:

        chunkLen = r.block.estimatedMaxChunkSize
        if i+1 < len(pIdxs) {
            if diff = pIdxs[i+1].offset - pIdx.offset; int(diff) < chunkLen {
                chunkLen = int(diff)
            }
        }

        // Fix: If we are about to read a chunk that is bigger than the buffer capacity,
        // we need to make sure we have enough space in the buffer.
        if cap(buf) < chunkLen {
            // Put the current buffer back to the pool.
            r.block.chunkPool.Put(&buf)

            // Get a new, bigger, buffer from the pool.
            bufPooled, err = r.block.chunkPool.Get(chunkLen)
            if err == nil {
                buf = *bufPooled
            } else {
                buf = make([]byte, chunkLen)
            }
        }
        // Fix: end

        cb := buf[:chunkLen]
        n, err = io.ReadFull(bufReader, cb)

From what I was able to debug, source of the problem is chunk size defined in the block meta json. For some reason it states smaller size than reality is, requested buffer is too small and store panics. I guess source of this will be compactor or receiver. We have also "historic stores" serving data created by Thanos v30 and older, there are no such panics, so I assume its combination of the store optimization together with something in receiver or compactor code not filing these values correctly in latest versions.

This will not fix source of this issue but at least fix store code which doesn't correctly check inputs.