thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
13.14k stars 2.1k forks source link

Receive panic during compact #6196

Closed jnyi closed 1 year ago

jnyi commented 1 year ago

Thanos, Prometheus and Golang version used: Thanos: v0.31.0-rc.0 Golang: go1.19.6

Object Storage Provider: MinIO for mock

What happened: See thanos ingestor panic with around ~100 query requests / second,

Receive using Ingestor only mode and its args:

      receive
      --debug.name=thanos-ingestor
      --log.format=logfmt
      --log.level=info
      --http-address=0.0.0.0:10902
      --http-grace-period=5m
      --grpc-address=0.0.0.0:10901
      --grpc-grace-period=5m
      --hash-func=SHA256
      --label replica="$(NAME)"
      --receive.default-tenant-id=default
      --receive.grpc-compression=snappy
      --remote-write.address=0.0.0.0:19291
      --receive.local-endpoint=$(NAME).thanos-ingestor-svc:10901
      --objstore.config-file=/secrets/thanos.yml
      --tsdb.path=/var/thanos/data
      --tsdb.retention=3d
      --tsdb.min-block-duration=2h
      --tsdb.max-block-duration=2h
      --tsdb.out-of-order.time-window=15m
      --tsdb.out-of-order.cap-max=16

Querier Args:

      query
      --debug.name=thanos-querier
      --log.format=logfmt
      --log.level=error
      --http-address=0.0.0.0:10902
      --http-grace-period=5m
      --grpc-address=0.0.0.0:10901
      --grpc-grace-period=5m
      --enable-feature=queryPushdown
      --grpc-compression=snappy
      --query.default-step=30s
      --query.lookback-delta=5m
      --query.max-concurrent=64
      --query.promql-engine=thanos
      --query.timeout=5m
      --query.replica-label=replica
      --store.response-timeout=1m
      --store=dnssrv+thanos-store-gateway:10901

What you expected to happen: running smoothly

How to reproduce it (as minimally and precisely as possible): Not sure why this happened, we just start feed query loads and it happens

Full logs to relevant components:

level=warn name=thanos-ingestor ts=2023-03-09T22:01:36.754164408Z caller=writer.go:210 component=receive component=receive-writer tenant=RwaDaemon msg="Error on ingesting samples with different value but same timestamp" numDropped=1
level=warn name=thanos-ingestor ts=2023-03-09T22:01:37.982209361Z caller=writer.go:210 component=receive component=receive-writer tenant=RwaDaemon msg="Error on ingesting samples with different value but same timestamp" numDropped=1
level=warn name=thanos-ingestor ts=2023-03-09T22:01:37.983128186Z caller=writer.go:210 component=receive component=receive-writer tenant=RwaDaemon msg="Error on ingesting samples with different value but same timestamp" numDropped=1
level=warn name=thanos-ingestor ts=2023-03-09T22:01:44.877165707Z caller=writer.go:210 component=receive component=receive-writer tenant=RwaDaemon msg="Error on ingesting samples with different value but same timestamp" numDropped=1
level=warn name=thanos-ingestor ts=2023-03-09T22:01:44.883470908Z caller=writer.go:210 component=receive component=receive-writer tenant=RwaDaemon msg="Error on ingesting samples with different value but same timestamp" numDropped=1
level=info name=thanos-ingestor ts=2023-03-09T22:01:56.986606953Z caller=compact.go:460 component=receive component=multi-tsdb tenant=RwaGlobal msg="compact blocks" count=4 mint=1678377600000 maxt=1678392000000 ulid=01GV465P7C6BFM7BMASAFM946N sources="[01GV3VN6S0E3WF6FH3TVWAQ3SK 01GV460JFC8N65A1EXG133Z177 01GV462P2QAASA3APHV3QB8WYS 01GV45XZYNPMCGMXBC27HBKQRZ]" duration=2m30.286546921s
unexpected fault address 0x7f6fa05485fc
fatal error: fault
[signal SIGSEGV: segmentation violation code=0x1 addr=0x7f6fa05485fc pc=0x46f0e1]

goroutine 686167694 [running]:
runtime.throw({0x25e51ec?, 0x0?})
    /usr/local/go/src/runtime/panic.go:1047 +0x5d fp=0xc083bfe7b8 sp=0xc083bfe788 pc=0x439ebd
runtime.sigpanic()
    /usr/local/go/src/runtime/signal_unix.go:842 +0x2c5 fp=0xc083bfe808 sp=0xc083bfe7b8 pc=0x450725
runtime.memmove()
    /usr/local/go/src/runtime/memmove_amd64.s:184 +0x141 fp=0xc083bfe810 sp=0xc083bfe808 pc=0x46f0e1
github.com/thanos-io/thanos/pkg/store/storepb.(*Chunk).MarshalToSizedBuffer(0xca6d0369c0, {0xc8c7c4a000, 0x64?, 0x1000003c0?})
    /app/pkg/store/storepb/types.pb.go:358 +0xce fp=0xc083bfe848 sp=0xc083bfe810 pc=0x14bab4e
github.com/thanos-io/thanos/pkg/store/storepb.(*AggrChunk).MarshalToSizedBuffer(0xcdd3f5fd80, {0xc8c7c4a000, 0x28e4, 0x2b83})
    /app/pkg/store/storepb/types.pb.go:504 +0x476 fp=0xc083bfe8a0 sp=0xc083bfe848 pc=0x14bb896
github.com/thanos-io/thanos/pkg/store/storepb.(*Series).MarshalToSizedBuffer(0xc7a90d9d40, {0xc8c7c4a000, 0x2b83, 0x2b83})
    /app/pkg/store/storepb/types.pb.go:394 +0x2c5 fp=0xc083bfe920 sp=0xc083bfe8a0 pc=0x14bb1a5
github.com/thanos-io/thanos/pkg/store/storepb.(*SeriesResponse_Series).MarshalToSizedBuffer(0x0?, {0xc8c7c4a000, 0x2b83, 0x14aeb8e?})
    /app/pkg/store/storepb/rpc.pb.go:1821 +0x3c fp=0xc083bfe950 sp=0xc083bfe920 pc=0x14abe1c
github.com/thanos-io/thanos/pkg/store/storepb.(*SeriesResponse_Series).MarshalTo(0xc083bfe9b0?, {0xc8c7c4a000, 0x2b83?, 0x2b83})
    /app/pkg/store/storepb/rpc.pb.go:1814 +0x47 fp=0xc083bfe988 sp=0xc083bfe950 pc=0x14abd87
github.com/thanos-io/thanos/pkg/store/storepb.(*SeriesResponse).MarshalToSizedBuffer(0xc972837600, {0xc8c7c4a000, 0x2b83, 0x2b83})
    /app/pkg/store/storepb/rpc.pb.go:1804 +0x99 fp=0xc083bfe9c0 sp=0xc083bfe988 pc=0x14abcb9
github.com/thanos-io/thanos/pkg/store/storepb.(*SeriesResponse).Marshal(0xc972837610?)
    /app/pkg/store/storepb/rpc.pb.go:1783 +0x56 fp=0xc083bfea08 sp=0xc083bfe9c0 pc=0x14abb16
google.golang.org/protobuf/internal/impl.legacyMarshal({{}, {0x2b75348, 0xc972837610}, {0x0, 0x0, 0x0}, 0x0})
    /go/pkg/mod/google.golang.org/protobuf@v1.28.1/internal/impl/legacy_message.go:402 +0xa2 fp=0xc083bfea90 sp=0xc083bfea08 pc=0x84cf42
google.golang.org/protobuf/proto.MarshalOptions.marshal({{}, 0x68?, 0x0, 0x0}, {0x0, 0x0, 0x0}, {0x2b75348, 0xc972837610})
    /go/pkg/mod/google.golang.org/protobuf@v1.28.1/proto/encode.go:166 +0x27b fp=0xc083bfeb30 sp=0xc083bfea90 pc=0x7e825b
google.golang.org/protobuf/proto.MarshalOptions.MarshalAppend({{}, 0xe0?, 0x39?, 0x53?}, {0x0, 0x0, 0x0}, {0x2b46060?, 0xc972837610?})
    /go/pkg/mod/google.golang.org/protobuf@v1.28.1/proto/encode.go:125 +0x79 fp=0xc083bfeb78 sp=0xc083bfeb30 pc=0x7e7e99
github.com/golang/protobuf/proto.marshalAppend({0x0, 0x0, 0x0}, {0x7f71043f8530?, 0xc972837600?}, 0x10?)
    /go/pkg/mod/github.com/golang/protobuf@v1.5.2/proto/wire.go:40 +0xa5 fp=0xc083bfebf8 sp=0xc083bfeb78 pc=0x879de5
github.com/golang/protobuf/proto.Marshal(...)
    /go/pkg/mod/github.com/golang/protobuf@v1.5.2/proto/wire.go:23
google.golang.org/grpc/encoding/proto.codec.Marshal({}, {0x25339e0, 0xc972837600})
    /go/pkg/mod/google.golang.org/grpc@v1.45.0/encoding/proto/proto.go:45 +0x4e fp=0xc083bfec48 sp=0xc083bfebf8 pc=0xf7e38e
google.golang.org/grpc/encoding/proto.(*codec).Marshal(0x100000000000002?, {0x25339e0?, 0xc972837600?})
    <autogenerated>:1 +0x37 fp=0xc083bfec68 sp=0xc083bfec48 pc=0xf7e577
google.golang.org/grpc.encode({0x7f7103956ac8?, 0x4057e40?}, {0x25339e0?, 0xc972837600?})
    /go/pkg/mod/google.golang.org/grpc@v1.45.0/rpc_util.go:594 +0x44 fp=0xc083bfecb8 sp=0xc083bfec68 pc=0xf96ea4
google.golang.org/grpc.prepareMsg({0x25339e0?, 0xc972837600?}, {0x7f7103956ac8?, 0x4057e40?}, {0x0, 0x0}, {0x2b557d0, 0xc000630370})
    /go/pkg/mod/google.golang.org/grpc@v1.45.0/stream.go:1610 +0xd2 fp=0xc083bfed30 sp=0xc083bfecb8 pc=0xfae1b2
google.golang.org/grpc.(*serverStream).SendMsg(0xc1720d95c0, {0x25339e0?, 0xc972837600})
    /go/pkg/mod/google.golang.org/grpc@v1.45.0/stream.go:1503 +0xf3 fp=0xc083bfee80 sp=0xc083bfed30 pc=0xfacfb3
github.com/grpc-ecosystem/go-grpc-prometheus.(*monitoredServerStream).SendMsg(0xc50414d248, {0x25339e0?, 0xc972837600?})
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:156 +0x33 fp=0xc083bfeeb8 sp=0xc083bfee80 pc=0x1c880f3
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors.(*monitoredServerStream).SendMsg(0xca58188c90, {0x25339e0, 0xc972837600})
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/interceptors/server.go:64 +0x5a fp=0xc083bfef20 sp=0xc083bfeeb8 pc=0x13c33da
github.com/grpc-ecosystem/go-grpc-middleware/v2.(*WrappedServerStream).SendMsg(0x6?, {0x25339e0?, 0xc972837600?})
    <autogenerated>:1 +0x34 fp=0xc083bfef48 sp=0xc083bfef20 pc=0x14d0514
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors.(*monitoredServerStream).SendMsg(0xca58188e10, {0x25339e0, 0xc972837600})
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/interceptors/server.go:64 +0x5a fp=0xc083bfefb0 sp=0xc083bfef48 pc=0x13c33da
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors.(*monitoredServerStream).SendMsg(0xca58188e70, {0x25339e0, 0xc972837600})
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/interceptors/server.go:64 +0x5a fp=0xc083bff018 sp=0xc083bfefb0 pc=0x13c33da
github.com/thanos-io/thanos/pkg/store/storepb.(*storeSeriesServer).Send(0x2b5e7e0?, 0xc04e7c7b50?)
    /app/pkg/store/storepb/rpc.pb.go:1113 +0x2b fp=0xc083bff040 sp=0xc083bff018 pc=0x14a7fab
github.com/thanos-io/thanos/pkg/store.(*limitedServer).Send(0xca581891d0, 0xc972837600)
    /app/pkg/store/limiter.go:178 +0xd2 fp=0xc083bff090 sp=0xc083bff040 pc=0x19f7f52
github.com/thanos-io/thanos/pkg/store.(*instrumentedServer).Send(0xc7d313d9c0, 0xc972837600)
    /app/pkg/store/telemetry.go:138 +0x32 fp=0xc083bff0b0 sp=0xc083bff090 pc=0x1a0cc92
github.com/thanos-io/thanos/pkg/store.(*ProxyStore).Series(0xc00071b3b0, 0xca9cf290e0, {0x2b68950, 0xc7d313d9c0})
    /app/pkg/store/proxy.go:323 +0x1075 fp=0xc083bff388 sp=0xc083bff0b0 pc=0x1a03135
github.com/thanos-io/thanos/pkg/store.(*instrumentedStoreServer).Series(0xc0008a4ab0, 0xc0d85482a0?, {0x2b689a0?, 0xca581891d0})
    /app/pkg/store/telemetry.go:117 +0x9d fp=0xc083bff3d0 sp=0xc083bff388 pc=0x1a0cbbd
github.com/thanos-io/thanos/pkg/store.(*limitedStoreServer).Series(0xc0008a4b70, 0x20e36a0?, {0x2b68ae0?, 0xcaa1149d90})
    /app/pkg/store/limiter.go:141 +0x18a fp=0xc083bff430 sp=0xc083bff3d0 pc=0x19f7e2a
github.com/thanos-io/thanos/pkg/store.(*ReadWriteTSDBStore).Series(0x10?, 0x7f8bae3f31d8?, {0x2b68ae0?, 0xcaa1149d90?})
    <autogenerated>:1 +0x34 fp=0xc083bff460 sp=0xc083bff430 pc=0x1a13894
github.com/thanos-io/thanos/pkg/store.(*recoverableStoreServer).Series(0xca58188e70?, 0x24ef7c0?, {0x2b68ae0?, 0xcaa1149d90?})
    /app/pkg/store/recover.go:28 +0x83 fp=0xc083bff4d0 sp=0xc083bff460 pc=0x1a0ba03
github.com/thanos-io/thanos/pkg/store/storepb._Store_Series_Handler({0x2304200?, 0xc0000c9f60}, {0x2b65440, 0xca58188e70})
    /app/pkg/store/storepb/rpc.pb.go:1100 +0xd0 fp=0xc083bff510 sp=0xc083bff4d0 pc=0x14a7f30
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors.StreamServerInterceptor.func1({0x2304200, 0xc0000c9f60}, {0x2b65440, 0xca58188e10}, 0xc50414d230, 0x26d7750)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/interceptors/server.go:35 +0x2ac fp=0xc083bff640 sp=0xc083bff510 pc=0x13c328c
github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainStreamServer.func1.1.1({0x2304200?, 0xc0000c9f60?}, {0x2b65440?, 0xca58188e10?})
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/chain.go:51 +0x3a fp=0xc083bff680 sp=0xc083bff640 pc=0x14cf8fa
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors.StreamServerInterceptor.func1({0x2304200, 0xc0000c9f60}, {0x2b653b0, 0xc7d313d940}, 0xc50414d230, 0xc7d313d8a0)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/interceptors/server.go:35 +0x2ac fp=0xc083bff7b0 sp=0xc083bff680 pc=0x13c328c
github.com/thanos-io/thanos/pkg/tracing.StreamServerInterceptor.func1({0x2304200, 0xc0000c9f60}, {0x2b65440?, 0xca58188c90?}, 0x30?, 0xc85e1c29a0?)
    /app/pkg/tracing/grpc.go:42 +0x19b fp=0xc083bff840 sp=0xc083bff7b0 pc=0x150d9fb
github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainStreamServer.func1.1.1({0x2304200?, 0xc0000c9f60?}, {0x2b65440?, 0xca58188c90?})
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/chain.go:51 +0x3a fp=0xc083bff880 sp=0xc083bff840 pc=0x14cf8fa
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors.StreamServerInterceptor.func1({0x2304200, 0xc0000c9f60}, {0x2b654d0, 0xc50414d248}, 0xc50414d230, 0xc7d313d8c0)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/interceptors/server.go:35 +0x2ac fp=0xc083bff9b0 sp=0xc083bff880 pc=0x13c328c
github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainStreamServer.func1.1.1({0x2304200?, 0xc0000c9f60?}, {0x2b654d0?, 0xc50414d248?})
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/chain.go:51 +0x3a fp=0xc083bff9f0 sp=0xc083bff9b0 pc=0x14cf8fa
github.com/grpc-ecosystem/go-grpc-prometheus.(*ServerMetrics).StreamServerInterceptor.func1({0x2304200, 0xc0000c9f60}, {0x2b65ea8?, 0xc1720d95c0}, 0xc85e1c2ad8?, 0xc7d313d8e0)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-prometheus@v1.2.0/server_metrics.go:121 +0x109 fp=0xc083bffa40 sp=0xc083bff9f0 pc=0x1c879c9
github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainStreamServer.func1.1.1({0x2304200?, 0xc0000c9f60?}, {0x2b65ea8?, 0xc1720d95c0?})
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/chain.go:51 +0x3a fp=0xc083bffa80 sp=0xc083bffa40 pc=0x14cf8fa
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors/recovery.StreamServerInterceptor.func1({0x2304200?, 0xc0000c9f60?}, {0x2b65ea8?, 0xc1720d95c0?}, 0x7f7106b55808?, 0xc50414d230?)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/interceptors/recovery/interceptors.go:45 +0x9e fp=0xc083bffaf8 sp=0xc083bffa80 pc=0x1c8c7be
github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainStreamServer.func1.1.1({0x2304200?, 0xc0000c9f60?}, {0x2b65ea8?, 0xc1720d95c0?})
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/chain.go:51 +0x3a fp=0xc083bffb38 sp=0xc083bffaf8 pc=0x14cf8fa
github.com/grpc-ecosystem/go-grpc-middleware/v2.ChainStreamServer.func1({0x2304200, 0xc0000c9f60}, {0x2b65ea8, 0xc1720d95c0}, 0x207bf60?, 0xcaa1149bf0?)
    /go/pkg/mod/github.com/grpc-ecosystem/go-grpc-middleware/v2@v2.0.0-rc.2.0.20201207153454-9f6bf00c00a7/chain.go:60 +0xbe fp=0xc083bffb90 sp=0xc083bffb38 pc=0x14cf79e
google.golang.org/grpc.(*Server).processStreamingRPC(0xc0007d2380, {0x2b6cea0, 0xcb1495d040}, 0xcaa1122c60, 0xc0008a52c0, 0x3fecbc0, 0x0)
    /go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1548 +0xf1b fp=0xc083bffe48 sp=0xc083bffb90 pc=0xfa0b3b
google.golang.org/grpc.(*Server).handleStream(0xc0007d2380, {0x2b6cea0, 0xcb1495d040}, 0xcaa1122c60, 0x0)
    /go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:1623 +0x9ea fp=0xc083bfff68 sp=0xc083bffe48 pc=0xfa234a
google.golang.org/grpc.(*Server).serveStreams.func1.2()
    /go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:921 +0x98 fp=0xc083bfffe0 sp=0xc083bfff68 pc=0xf9b8b8
runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1594 +0x1 fp=0xc083bfffe8 sp=0xc083bfffe0 pc=0x46e0e1
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /go/pkg/mod/google.golang.org/grpc@v1.45.0/server.go:919 +0x28a

Anything else we need to know:

GiedriusS commented 1 year ago

Is this the full log? Looks like it is missing some data.

fpetkovski commented 1 year ago

Potentially related https://github.com/thanos-io/thanos/issues/6190. However we have around 10.000 RPS against receivers and rarely have panics like this one.

Did you run Thanos Receive prior to v0.31.0-rc.0, and do you see any correlation with head compaction times? Also as Giedrius mentioned, I am not sure that the relevant error is present in the logs. The goroutine that caused the panic is the first one in the stack trace so we would need to see the beginning of the logs.

jnyi commented 1 year ago

we tried v0.30.2 and switch to v0.31.0-rc.0 due to the other panic fix https://github.com/thanos-io/thanos/pull/6067 when receiver starts

You are right, somehow the k8s didn't print the full logs from previous terminated pod, this time I was able to capture logs in time and changed the description.

fpetkovski commented 1 year ago

@jnyi I've published v0.31.0-rc.1. Could you try it out to see if it fixes the issue for you?

jnyi commented 1 year ago

Yep, I've incorporated your change and so far we are not seeing the issue anymore

fpetkovski commented 1 year ago

Awesome, thanks for confirming. I'll close this issue for now and feel free to reopen if it happens again.