grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.13k stars 529 forks source link

Ingester SIGBUS when queried against, and PV gets filled up #3212

Open rytswd opened 2 years ago

rytswd commented 2 years ago

Describe the bug

I have been testing Mimir with Istio in the same cluster, but Ingester pods get killed when queried against, which seems to happen only when Mimir has been running for a while (around 24 hours). After the SIGBUS, Ingester fails to come back up due to the no disk space on PV (it seems as if PV gets completely filled up when SIGBUS happens?).

To Reproduce

Steps to reproduce the behavior:

  1. Start Mimir v2.3.1 (with Helm v3.1.0 chart, and with Istio in the cluster)
  2. Let some Prometheus generate enough metrics (I happened to let it run for about 24 hours for 3 occasions when I could reproduce)
  3. Check Mimir Ingester state -> healthy, WAL written with no error
  4. Query some metrics from Mimir (I used Grafana explore against mimir-nginx)
  5. Check Mimir Ingester state -> panic with SIGBUS

At this point, somehow Ingester's PVs get completely full, and cannot start up anymore with "no disk space" error. It looks as if the SIGBUS error is also causing the PV to be filled up completely.

Note that, simply deploying Mimir and querying shortly after works just fine. This error is reproducible after running the metrics for a while.

Expected behavior

Ingester should keep running healthily.

Environment

Additional Context

I'm testing Mimir with Istio in the same cluster. There are a few modifications I made on the Helm Chart such as port names etc., but nothing major. I don't believe Istio is causing the error from the logs below, but any help appreciated!

unexpected fault address 0x7f5d14bc0000
fatal error: fault
[signal SIGBUS: bus error code=0x2 addr=0x7f5d14bc0000 pc=0x18d94a7]

goroutine 4375839 [running]:
runtime.throw({0x1f7a8a1?, 0x19e9db8?})
    /usr/local/go/src/runtime/panic.go:992 +0x71 fp=0xc00a9b92e8 sp=0xc00a9b92b8 pc=0x438771
runtime.sigpanic()
    /usr/local/go/src/runtime/signal_unix.go:815 +0x125 fp=0xc00a9b9338 sp=0xc00a9b92e8 pc=0x44e965
encoding/binary.bigEndian.PutUint64(...)
    /usr/local/go/src/encoding/binary/binary.go:139
github.com/grafana/mimir/pkg/util/activitytracker.(*ActivityTracker).Insert(0xc000997270, 0xc00a9b9420)
    /__w/mimir/mimir/pkg/util/activitytracker/activity_tracker.go:120 +0x147 fp=0xc00a9b93c8 sp=0xc00a9b9338 pc=0x18d94a7
github.com/grafana/mimir/pkg/ingester.(*ActivityTrackerWrapper).LabelNames(0xc000a409e0, {0x2598168, 0xc013f17980}, 0xc01a0389f0)
    /__w/mimir/mimir/pkg/ingester/ingester_activity.go:70 +0xaa fp=0xc00a9b9458 sp=0xc00a9b93c8 pc=0x19e9c2a
github.com/grafana/mimir/pkg/ingester/client._Ingester_LabelNames_Handler.func1({0x2598168, 0xc013f17980}, {0x1eef8c0?, 0xc01a0389f0})
    /__w/mimir/mimir/pkg/ingester/client/ingester.pb.go:3802 +0x78 fp=0xc00a9b9498 sp=0xc00a9b9458 pc=0x175efd8
github.com/grafana/mimir/pkg/mimir.ThanosTracerUnaryInterceptor({0x2598168?, 0xc013f17950?}, {0x1eef8c0, 0xc01a0389f0}, 0xc0140062f0?, 0xc01a038a20)
    /__w/mimir/mimir/pkg/mimir/tracing.go:25 +0x7a fp=0xc00a9b94d8 sp=0xc00a9b9498 pc=0x1a5acba
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x2598168?, 0xc013f17950?}, {0x1eef8c0?, 0xc01a0389f0?})
    /__w/mimir/mimir/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a fp=0xc00a9b9518 sp=0xc00a9b94d8 pc=0x1077eba
github.com/weaveworks/common/middleware.ServerUserHeaderInterceptor({0x2598168?, 0xc013f178f0?}, {0x1eef8c0, 0xc01a0389f0}, 0xc00a9b95e0?, 0xc0097330c0)
    /__w/mimir/mimir/vendor/github.com/weaveworks/common/middleware/grpc_auth.go:38 +0x65 fp=0xc00a9b9548 sp=0xc00a9b9518 pc=0x10b9c25
github.com/grafana/mimir/pkg/util/noauth.SetupAuthMiddleware.func1({0x2598168, 0xc013f178f0}, {0x1eef8c0, 0xc01a0389f0}, 0xc0097330a0, 0xc0097330c0)
    /__w/mimir/mimir/pkg/util/noauth/no_auth.go:32 +0xa7 fp=0xc00a9b9588 sp=0xc00a9b9548 pc=0x1a43dc7
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x2598168?, 0xc013f178f0?}, {0x1eef8c0?, 0xc01a0389f0?})
    /__w/mimir/mimir/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a fp=0xc00a9b95c8 sp=0xc00a9b9588 pc=0x1077eba
github.com/weaveworks/common/middleware.UnaryServerInstrumentInterceptor.func1({0x2598168, 0xc013f178f0}, {0x1eef8c0, 0xc01a0389f0}, 0xc0097330a0, 0xc0097330e0)
    /__w/mimir/mimir/vendor/github.com/weaveworks/common/middleware/grpc_instrumentation.go:35 +0xa2 fp=0xc00a9b9658 sp=0xc00a9b95c8 pc=0x10ba142
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x2598168?, 0xc013f178f0?}, {0x1eef8c0?, 0xc01a0389f0?})
    /__w/mimir/mimir/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a fp=0xc00a9b9698 sp=0xc00a9b9658 pc=0x1077eba
github.com/opentracing-contrib/go-grpc.OpenTracingServerInterceptor.func1({0x2598168, 0xc013f17860}, {0x1eef8c0, 0xc01a0389f0}, 0xc0097330a0, 0xc009733100)
    /__w/mimir/mimir/vendor/github.com/opentracing-contrib/go-grpc/server.go:57 +0x402 fp=0xc00a9b98c8 sp=0xc00a9b9698 pc=0x107b3e2
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x2598168?, 0xc013f17860?}, {0x1eef8c0?, 0xc01a0389f0?})
    /__w/mimir/mimir/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a fp=0xc00a9b9908 sp=0xc00a9b98c8 pc=0x1077eba
github.com/weaveworks/common/middleware.GRPCServerLog.UnaryServerInterceptor({{0x25ab7b0?, 0xc00071e8b0?}, 0x0?, 0x4?}, {0x2598168, 0xc013f17860}, {0x1eef8c0, 0xc01a0389f0}, 0xc0097330a0, 0xc009733120)
    /__w/mimir/mimir/vendor/github.com/weaveworks/common/middleware/grpc_logging.go:30 +0xbb fp=0xc00a9b99c0 sp=0xc00a9b9908 pc=0x10bb33b
github.com/weaveworks/common/middleware.GRPCServerLog.UnaryServerInterceptor-fm({0x2598168?, 0xc013f17860?}, {0x1eef8c0?, 0xc01a0389f0?}, 0x7f5d121291c0?, 0xc0097330a0?)
    <autogenerated>:1 +0x71 fp=0xc00a9b9a18 sp=0xc00a9b99c0 pc=0x10c7d31
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1.1.1({0x2598168?, 0xc013f17860?}, {0x1eef8c0?, 0xc01a0389f0?})
    /__w/mimir/mimir/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:25 +0x3a fp=0xc00a9b9a58 sp=0xc00a9b9a18 pc=0x1077eba
github.com/grpc-ecosystem/go-grpc-middleware.ChainUnaryServer.func1({0x2598168, 0xc013f17860}, {0x1eef8c0, 0xc01a0389f0}, 0xc00f941af8?, 0x1c6e340?)
    /__w/mimir/mimir/vendor/github.com/grpc-ecosystem/go-grpc-middleware/chain.go:34 +0xbf fp=0xc00a9b9ab0 sp=0xc00a9b9a58 pc=0x1077d5f
github.com/grafana/mimir/pkg/ingester/client._Ingester_LabelNames_Handler({0x1ea2000?, 0xc000a409e0}, {0x2598168, 0xc013f17860}, 0xc01bc458c0, 0xc0009e35c0)
    /__w/mimir/mimir/pkg/ingester/client/ingester.pb.go:3804 +0x138 fp=0xc00a9b9b08 sp=0xc00a9b9ab0 pc=0x175ee98
google.golang.org/grpc.(*Server).processUnaryRPC(0xc000a25dc0, {0x25aaa08, 0xc0007a4820}, 0xc0179d3440, 0xc000b178f0, 0x3489e48, 0x0)
    /__w/mimir/mimir/vendor/google.golang.org/grpc/server.go:1282 +0xccf fp=0xc00a9b9e48 sp=0xc00a9b9b08 pc=0xa6032f
google.golang.org/grpc.(*Server).handleStream(0xc000a25dc0, {0x25aaa08, 0xc0007a4820}, 0xc0179d3440, 0x0)
    /__w/mimir/mimir/vendor/google.golang.org/grpc/server.go:1619 +0xa1b fp=0xc00a9b9f68 sp=0xc00a9b9e48 pc=0xa6495b
google.golang.org/grpc.(*Server).serveStreams.func1.2()
    /__w/mimir/mimir/vendor/google.golang.org/grpc/server.go:921 +0x98 fp=0xc00a9b9fe0 sp=0xc00a9b9f68 pc=0xa5de98
    runtime.goexit()
    /usr/local/go/src/runtime/asm_amd64.s:1571 +0x1 fp=0xc00a9b9fe8 sp=0xc00a9b9fe0 pc=0x46b841
created by google.golang.org/grpc.(*Server).serveStreams.func1
    /__w/mimir/mimir/vendor/google.golang.org/grpc/server.go:919 +0x28a

goroutine 1 [select, 1141 minutes]:
github.com/grafana/dskit/services.(*Manager).AwaitStopped(0xc00009c480, {0x25980f8, 0xc000066050})
    /__w/mimir/mimir/vendor/github.com/grafana/dskit/services/manager.go:145 +0x6d
github.com/grafana/mimir/pkg/mimir.(*Mimir).Run(0xc0009e4000)
    /__w/mimir/mimir/pkg/mimir/mimir.go:796 +0x7ce
main.main()
    /__w/mimir/mimir/cmd/mimir/main.go:212 +0xb30

goroutine 30 [select]:
go.opencensus.io/stats/view.(*worker).start(0xc0001ad500)
    /__w/mimir/mimir/vendor/go.opencensus.io/stats/view/worker.go:276 +0xad
created by go.opencensus.io/stats/view.init.0
    /__w/mimir/mimir/vendor/go.opencensus.io/stats/view/worker.go:34 +0x8d

goroutine 10 [chan receive, 1141 minutes]:
github.com/grafana/mimir/pkg/alertmanager.init.0.func1()
    /__w/mimir/mimir/pkg/alertmanager/alertmanager.go:133 +0x45
created by github.com/grafana/mimir/pkg/alertmanager.init.0
    /__w/mimir/mimir/pkg/alertmanager/alertmanager.go:129 +0x25

goroutine 202 [chan receive, 1141 minutes]:
github.com/grafana/dskit/services.(*BasicService).AddListener.func1()
    /__w/mimir/mimir/vendor/github.com/grafana/dskit/services/basic_service.go:344 +0x66
created by github.com/grafana/dskit/services.(*BasicService).AddListener
github.com/prometheus/prometheus/tsdb/chunks.(*writeJobQueue).pop(0xc000b2ab40)
    /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/chunks/queue.go:115 +0xc6
github.com/prometheus/prometheus/tsdb/chunks.(*chunkWriteQueue).start.func1()
    /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/chunks/chunk_write_queue.go:121 +0xc5
created by github.com/prometheus/prometheus/tsdb/chunks.(*chunkWriteQueue).start
    /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/chunks/chunk_write_queue.go:117 +0x6d

goroutine 820 [select, 644 minutes]:
github.com/prometheus/prometheus/tsdb/wal.(*WAL).run(0xc000bd07e0)
    /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/wal/wal.go:342 +0xa5
created by github.com/prometheus/prometheus/tsdb/wal.NewSize
    /__w/mimir/mimir/vendor/github.com/prometheus/prometheus/tsdb/wal/wal.go:311 +0x30a

goroutine 4366351 [select, 1 minutes]:
google.golang.org/grpc/internal/transport.(*http2Server).keepalive(0xc0007a4680)
    /__w/mimir/mimir/vendor/google.golang.org/grpc/internal/transport/http2_server.go:1125 +0x233
created by google.golang.org/grpc/internal/transport.NewServerTransport
    /__w/mimir/mimir/vendor/google.golang.org/grpc/internal/transport/http2_server.go:335 +0x1878

goroutine 4366352 [IO wait]:
internal/poll.runtime_pollWait(0x7f5d14ed5868, 0x72)
    /usr/local/go/src/runtime/netpoll.go:302 +0x89
internal/poll.(*pollDesc).wait(0xc01069ee80?, 0xc00bb68000?, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:83 +0x32
internal/poll.(*pollDesc).waitRead(...)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:88
internal/poll.(*FD).Read(0xc01069ee80, {0xc00bb68000, 0x8000, 0x8000})
    /usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc01069ee80, {0xc00bb68000?, 0x1060100000000?, 0x8?})
    /usr/local/go/src/net/fd_posix.go:55 +0x29
net.(*conn).Read(0xc00ecb0508, {0xc00bb68000?, 0xc01a0386d8?, 0x0?})
    /usr/local/go/src/net/net.go:183 +0x45
bufio.(*Reader).Read(0xc01584e120, {0xc000ebe200, 0x9, 0x3e4a79b27290?})
    /usr/local/go/src/bufio/bufio.go:236 +0x1b4
io.ReadAtLeast({0x257ca40, 0xc01584e120}, {0xc000ebe200, 0x9, 0x9}, 0x9)
    /usr/local/go/src/io/io.go:331 +0x9a
io.ReadFull(...)
    /usr/local/go/src/io/io.go:350
golang.org/x/net/http2.readFrameHeader({0xc000ebe200?, 0x9?, 0xc0173fe000?}, {0x257ca40?, 0xc01584e120?})
    /__w/mimir/mimir/vendor/golang.org/x/net/http2/frame.go:237 +0x6e
golang.org/x/net/http2.(*Framer).ReadFrame(0xc000ebe1c0)
    /__w/mimir/mimir/vendor/golang.org/x/net/http2/frame.go:498 +0x95
google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams(0xc0007a4680, 0x2575220?, 0x34dd801?)
    /__w/mimir/mimir/vendor/google.golang.org/grpc/internal/transport/http2_server.go:605 +0xb2
google.golang.org/grpc.(*Server).serveStreams(0xc000a25dc0, {0x25aaa08?, 0xc0007a4680})
    /__w/mimir/mimir/vendor/google.golang.org/grpc/server.go:905 +0x142
google.golang.org/grpc.(*Server).handleRawConn.func1()
    /__w/mimir/mimir/vendor/google.golang.org/grpc/server.go:847 +0x46
created by google.golang.org/grpc.(*Server).handleRawConn
    /__w/mimir/mimir/vendor/google.golang.org/grpc/server.go:846 +0x185

goroutine 4370111 [select, 1 minutes]:
google.golang.org/grpc/internal/transport.(*http2Server).keepalive(0xc0007a4820)
    /__w/mimir/mimir/vendor/google.golang.org/grpc/internal/transport/http2_server.go:1125 +0x233
created by google.golang.org/grpc/internal/transport.NewServerTransport
    /__w/mimir/mimir/vendor/google.golang.org/grpc/internal/transport/http2_server.go:335 +0x1878

goroutine 4370112 [IO wait]:
internal/poll.runtime_pollWait(0x7f5d14ed5688, 0x72)
    /usr/local/go/src/runtime/netpoll.go:302 +0x89
internal/poll.(*pollDesc).wait(0xc0137ea880?, 0xc00e336000?, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:83 +0x32
internal/poll.(*pollDesc).waitRead(...)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:88
internal/poll.(*FD).Read(0xc0137ea880, {0xc00e336000, 0x8000, 0x8000})
    /usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc0137ea880, {0xc00e336000?, 0x1000100000b59?, 0xb5900000015?})
    /usr/local/go/src/net/fd_posix.go:55 +0x29
net.(*conn).Read(0xc01c989b90, {0xc00e336000?, 0xc01a0389c0?, 0x0?})
    /usr/local/go/src/net/net.go:183 +0x45
bufio.(*Reader).Read(0xc018dd16e0, {0xc00024a740, 0x9, 0xc01a0389c0?})
    /usr/local/go/src/bufio/bufio.go:236 +0x1b4
io.ReadAtLeast({0x257ca40, 0xc018dd16e0}, {0xc00024a740, 0x9, 0x9}, 0x9)
    /usr/local/go/src/io/io.go:331 +0x9a
io.ReadFull(...)
    /usr/local/go/src/io/io.go:350
golang.org/x/net/http2.readFrameHeader({0xc00024a740?, 0x9?, 0xc000068080?}, {0x257ca40?, 0xc018dd16e0?})
    /__w/mimir/mimir/vendor/golang.org/x/net/http2/frame.go:237 +0x6e
golang.org/x/net/http2.(*Framer).ReadFrame(0xc00024a700)
    /__w/mimir/mimir/vendor/golang.org/x/net/http2/frame.go:498 +0x95
google.golang.org/grpc/internal/transport.(*http2Server).HandleStreams(0xc0007a4820, 0x258c740?, 0x34dd870?)
    /__w/mimir/mimir/vendor/google.golang.org/grpc/internal/transport/http2_server.go:605 +0xb2
google.golang.org/grpc.(*Server).serveStreams(0xc000a25dc0, {0x25aaa08?, 0xc0007a4820})
    /__w/mimir/mimir/vendor/google.golang.org/grpc/server.go:905 +0x142
google.golang.org/grpc.(*Server).handleRawConn.func1()
    /__w/mimir/mimir/vendor/google.golang.org/grpc/server.go:847 +0x46
created by google.golang.org/grpc.(*Server).handleRawConn
    /__w/mimir/mimir/vendor/google.golang.org/grpc/server.go:846 +0x185

goroutine 4366350 [select]:
google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc013e13f90, 0x1)
    /__w/mimir/mimir/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:407 +0x115
google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc01584e720)
    /__w/mimir/mimir/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:534 +0x85
google.golang.org/grpc/internal/transport.NewServerTransport.func2()
    /__w/mimir/mimir/vendor/google.golang.org/grpc/internal/transport/http2_server.go:326 +0xc6
created by google.golang.org/grpc/internal/transport.NewServerTransport
    /__w/mimir/mimir/vendor/google.golang.org/grpc/internal/transport/http2_server.go:323 +0x1833

goroutine 4370110 [runnable]:
google.golang.org/grpc/internal/transport.(*controlBuffer).get(0xc0188f8190, 0x1)
    /__w/mimir/mimir/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:407 +0x115
google.golang.org/grpc/internal/transport.(*loopyWriter).run(0xc018dd18c0)
    /__w/mimir/mimir/vendor/google.golang.org/grpc/internal/transport/controlbuf.go:534 +0x85
google.golang.org/grpc/internal/transport.NewServerTransport.func2()
    /__w/mimir/mimir/vendor/google.golang.org/grpc/internal/transport/http2_server.go:326 +0xc6
created by google.golang.org/grpc/internal/transport.NewServerTransport
    /__w/mimir/mimir/vendor/google.golang.org/grpc/internal/transport/http2_server.go:323 +0x1833

goroutine 4375824 [IO wait]:
internal/poll.runtime_pollWait(0x7f5d14ed51d8, 0x72)
    /usr/local/go/src/runtime/netpoll.go:302 +0x89
internal/poll.(*pollDesc).wait(0xc013e5d680?, 0xc016846240?, 0x0)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:83 +0x32
internal/poll.(*pollDesc).waitRead(...)
    /usr/local/go/src/internal/poll/fd_poll_runtime.go:88
internal/poll.(*FD).Read(0xc013e5d680, {0xc016846240, 0x1, 0x1})
    /usr/local/go/src/internal/poll/fd_unix.go:167 +0x25a
net.(*netFD).Read(0xc013e5d680, {0xc016846240?, 0x100c022212b60?, 0x40a50d?})
    /usr/local/go/src/net/fd_posix.go:55 +0x29
net.(*conn).Read(0xc022f52608, {0xc016846240?, 0xc000be0558?, 0x40b416?})
    /usr/local/go/src/net/net.go:183 +0x45
io.ReadAtLeast({0x25841a0, 0xc022f52608}, {0xc016846240, 0x1, 0x1}, 0x1)
    /usr/local/go/src/io/io.go:331 +0x9a
io.ReadFull(...)
    /usr/local/go/src/io/io.go:350
github.com/grafana/dskit/kv/memberlist.(*TCPTransport).handleConnection(0xc0000001e0, {0x25a9b88, 0xc022f52608})
    /__w/mimir/mimir/vendor/github.com/grafana/dskit/kv/memberlist/tcp_transport.go:250 +0x1e5
created by github.com/grafana/dskit/kv/memberlist.(*TCPTransport).tcpListen
    /__w/mimir/mimir/vendor/github.com/grafana/dskit/kv/memberlist/tcp_transport.go:225 +0x295
pracucci commented 2 years ago

Based on your report, I think there are two distinct issues to investigate:

  1. Why does the process crash with SIGBUS?
  2. Why the disk is full at startup?

1. Why does the process crash with SIGBUS?

According to the stack trace you shared, the SIGBUS is triggered by this call:

github.com/grafana/mimir/pkg/util/activitytracker.(*ActivityTracker).Insert(0xc000997270, 0xc00a9b9420)

What we do here is writing to the activity tracker log file. The file is mmap-ed, so it's a file on disk which gets mapped into memory. We write there each time a query is run.

You mentioned that the issue doesn't happen if you started Mimir just recently. Mimir keeps the last 24h of data on ingesters disk (by default) which makes me think you actually have exausted the disk space when this issue happens, so I would start investigating from (2).

2. Why the disk is full at startup?

Have you got a chance to look inside the disk, and see what actually takes the space there?

rytswd commented 2 years ago

@pracucci Thanks for the detailed explanation! I took a further look into the disk usage - it was in fact WAL taking up so much more space than I originally anticipated. I got two follow-up questions on this:

  1. Should Mimir panic when there is no disk space for the activity tracker log file?
  2. What are the flags to be adjusted for data retention setup?

1. Should Mimir panic when there is no disk space for the activity tracker logs?

Although I can understand why it crashed, I would have hoped Mimir to gracefully handle the full disk error, similar to WAL. In my setup, once an ingester panics, it won't come up due to the insufficient disk space, and it was a bit cumbersome to investigate the disk usage. The panic log was a clear indication something went completely off, but I feel panic may be a bit too extreme.

If this could be a potential enhancement, I'd be keen to take a stab at the actual implementation as well ☺️

2. What are the flags to be adjusted for data retention setup?

I can see blocks_storage.tsdb.retention_period is set to 24h by default, but is this the right one, and are there other flags I should be adjusting? For my use case, I would like to send the data to S3 or other object storage rather quickly, but couldn't figure out which set of configurations should be updated from the doc... It would be great if you could provide some pointers for what to look for / to be aware of 🙏

pracucci commented 2 years ago
  1. Should Mimir panic when there is no disk space for the activity tracker logs?

Ideally no. The activity tracker is not an essential feature, so it shouldn't crash the process if there's no space left on disk. Because of this, I'm open to a fix unless it significantly completes the code (see below).

That being said, is disk exhausted really different from an out of memory issue? Generally speaking, the system has exhausted a non-compressible resource (e.g. memory or disk) and the process can't work anymore as expected. The process crashes if memory is exhausted, and I personally prefer processes that make it very loud that an essential resource has been exhausted, so I don't even think it's that bad.

  1. What are the flags to be adjusted for data retention setup?

The short answer is to reduce blocks_storage.tsdb.retention_period to a value not lower than 13h.

If you want to reduce it to a value lower than 13h, there are other configuration values to fine tune to get the system correctly working on the read-path (otherwise you will see gaps in your queries, because by default we query the last 12h of data only from ingesters). We generally strongly recommend to run Mimir with the default config (which has been tuned based on Grafana Labs experience) and just increase disk size if possible :)

rytswd commented 2 years ago

Sorry for my delayed response, thanks for all the details!

I agree that the disk space error is similar to OOM, and the error being loud and clear makes sense. While I do understand services failing to become "ready", failing with a panic here seems to be a bit too crude IMHO. It was quite difficult to pinpoint the error cause when it panic'ed the first time, and I could only figure out the reproduction steps days after 😅

For the retention period, I appreciate those details, I wouldn't have known if you didn't point them out. I may have missed, but is there some documentation as to what values can be tweaked without affecting other dependent setup? While the configuration flexibility gives us a lot of power, I am probably missing the lot of nuances for each field made available for user configurations...

For this specific use case, though, I am opting to go with the disk size update as you suggested. So please go ahead closing the ticket if no further actions are needed based on our conversation. If there is anything I can help around error/panic handling and/or documentation, please let me know, I'd be happy to contribute back 🥰

pracucci commented 1 year ago

Thanks for your follow up (and sorry for this very late reply).

I may have missed, but is there some documentation as to what values can be tweaked without affecting other dependent setup?

I'm not sure I understand this question. If you're referring to the fact that some configuration parameters need to set on multiple components, then we recommend to configure Mimir via YAML and not CLI flags, and then share the same YAML config across all components.