thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.95k stars 2.08k forks source link

why can't thanos fail properly on errors (like everone else)? #7496

Open calestyo opened 2 months ago

calestyo commented 2 months ago

Thanos, Prometheus and Golang version used:

# thanos --version
thanos, version 0.34.1 (branch: HEAD, revision: 4cf1559998bf6d8db3f9ca0fde2a00d217d4e23e)
  build user:       root@61db75277a55
  build date:       20240219-17:13:48
  go version:       go1.21.7
  platform:         linux/amd64
  tags:             netgo

Object Storage Provider: filesystem

What happened: I've seen that already before with compact (#7197), but it seems rather a general critical design problem in thanos, which now also occurred at least in alertmanager and receive.

Thanos seems incapable to simply fail in case of an error, but keeps running forever, but never actually re-trying to continue.

In the current case two services were affected. First, receive:

(more of the following)
Jun 22 22:31:12 lcg-lrz-monitor-ext thanos@receive[1130]: ts=2024-06-22T20:31:12.473601185Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : commit samples: write to WAL: log samples: write /data/main/thanos/receive/tsdb/etp.physik.lmu.de/wal/00020198: no space left on device" msg="internal server error"
Jun 22 22:31:12 lcg-lrz-monitor-ext thanos@receive[1130]: ts=2024-06-22T20:31:12.766539396Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : commit samples: write to WAL: log samples: write /data/main/thanos/receive/tsdb/etp.physik.lmu.de/wal/00020198: no space left on device" msg="internal server error"
Jun 22 22:31:13 lcg-lrz-monitor-ext thanos@receive[1130]: ts=2024-06-22T20:31:13.370390121Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : commit samples: write to WAL: log samples: write /data/main/thanos/receive/tsdb/etp.physik.lmu.de/wal/00020198: no space left on device" msg="internal server error"
Jun 22 22:46:18 lcg-lrz-monitor-ext thanos@receive[1130]: ts=2024-06-22T20:46:18.142270357Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : commit samples: write to WAL: log samples: write /data/main/thanos/receive/tsdb/etp.physik.lmu.de/wal/00020199: no space left on device" msg="internal server error"
Jun 22 22:46:18 lcg-lrz-monitor-ext thanos@receive[1130]: ts=2024-06-22T20:46:18.72264366Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : commit samples: write to WAL: log samples: write /data/main/thanos/receive/tsdb/etp.physik.lmu.de/wal/00020199: no space left on device" msg="internal server error"
Jun 22 22:46:19 lcg-lrz-monitor-ext thanos@receive[1130]: ts=2024-06-22T20:46:19.39288321Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : commit samples: write to WAL: log samples: write /data/main/thanos/receive/tsdb/etp.physik.lmu.de/wal/00020199: no space left on device" msg="internal server error"
Jun 22 22:46:19 lcg-lrz-monitor-ext thanos@receive[1130]: ts=2024-06-22T20:46:19.867254808Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : commit samples: write to WAL: log samples: write /data/main/thanos/receive/tsdb/etp.physik.lmu.de/wal/00020199: no space left on device" msg="internal server error"
Jun 22 22:46:20 lcg-lrz-monitor-ext thanos@receive[1130]: ts=2024-06-22T20:46:20.491156935Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : commit samples: write to WAL: log samples: write /data/main/thanos/receive/tsdb/etp.physik.lmu.de/wal/00020199: no space left on device" msg="internal server error"
Jun 22 22:46:21 lcg-lrz-monitor-ext thanos@receive[1130]: ts=2024-06-22T20:46:21.125543223Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : commit samples: write to WAL: log samples: write /data/main/thanos/receive/tsdb/etp.physik.lmu.de/wal/00020199: no space left on device" msg="internal server error"
Jun 22 23:00:28 lcg-lrz-monitor-ext thanos@receive[1130]: ts=2024-06-22T21:00:28.019016426Z caller=db.go:999 level=error component=receive component=multi-tsdb tenant=etp.physik.lmu.de msg="compaction failed" err="WAL truncation in Compact: create checkpoint: read segments: corruption in segment /data/main/thanos/receive/tsdb/etp.physik.lmu.de/wal/00020072 at 105286437: unexpected full record"
Jun 22 23:01:28 lcg-lrz-monitor-ext thanos@receive[1130]: ts=2024-06-22T21:01:28.851875565Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : commit samples: write to WAL: log samples: write /data/main/thanos/receive/tsdb/etp.physik.lmu.de/wal/00020201: no space left on device" msg="internal server error"
Jun 22 23:01:29 lcg-lrz-monitor-ext thanos@receive[1130]: ts=2024-06-22T21:01:29.057743344Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : commit samples: write to WAL: log samples: write /data/main/thanos/receive/tsdb/etp.physik.lmu.de/wal/00020201: no space left on device" msg="internal server error"
Jun 22 23:01:29 lcg-lrz-monitor-ext thanos@receive[1130]: panic: write /data/main/thanos/receive/tsdb/etp.physik.lmu.de/chunks_head/005393: no space left on device
Jun 22 23:01:29 lcg-lrz-monitor-ext thanos@receive[1130]: goroutine 1812 [running]:
Jun 22 23:01:29 lcg-lrz-monitor-ext thanos@receive[1130]: github.com/prometheus/prometheus/tsdb.handleChunkWriteError({0x35f2840, 0xc0306da660})
Jun 22 23:01:29 lcg-lrz-monitor-ext thanos@receive[1130]:         /go/pkg/mod/github.com/prometheus/prometheus@v0.48.1-0.20231212213830-d0c2d9c0b9cc/tsdb/head_append.go:1491 +0x5c
Jun 22 23:01:29 lcg-lrz-monitor-ext thanos@receive[1130]: github.com/prometheus/prometheus/tsdb/chunks.(*ChunkDiskMapper).WriteChunk(0xc0007624b0, 0xc000854e58?, 0x10000000044ede6?, 0xffffffffffffffff?, {0x36180a0, 0xc026646940}, 0x98?, 0x30b4488)
Jun 22 23:01:29 lcg-lrz-monitor-ext thanos@receive[1130]:         /go/pkg/mod/github.com/prometheus/prometheus@v0.48.1-0.20231212213830-d0c2d9c0b9cc/tsdb/chunks/head_chunks.go:443 +0x14f
Jun 22 23:01:29 lcg-lrz-monitor-ext thanos@receive[1130]: github.com/prometheus/prometheus/tsdb.(*memSeries).mmapChunks(0xc0078d0b40, 0x408f7e?)
Jun 22 23:01:29 lcg-lrz-monitor-ext thanos@receive[1130]:         /go/pkg/mod/github.com/prometheus/prometheus@v0.48.1-0.20231212213830-d0c2d9c0b9cc/tsdb/head_append.go:1473 +0xba
Jun 22 23:01:29 lcg-lrz-monitor-ext thanos@receive[1130]: github.com/prometheus/prometheus/tsdb.(*Head).mmapHeadChunks(0xc000700400)
Jun 22 23:01:29 lcg-lrz-monitor-ext thanos@receive[1130]:         /go/pkg/mod/github.com/prometheus/prometheus@v0.48.1-0.20231212213830-d0c2d9c0b9cc/tsdb/head.go:1703 +0x1d2
Jun 22 23:01:29 lcg-lrz-monitor-ext thanos@receive[1130]: github.com/prometheus/prometheus/tsdb.(*DB).run(0xc0009ead00, {0x3611728, 0xc00098ee10})
Jun 22 23:01:29 lcg-lrz-monitor-ext thanos@receive[1130]:         /go/pkg/mod/github.com/prometheus/prometheus@v0.48.1-0.20231212213830-d0c2d9c0b9cc/tsdb/db.go:992 +0x372
Jun 22 23:01:29 lcg-lrz-monitor-ext thanos@receive[1130]: created by github.com/prometheus/prometheus/tsdb.open in goroutine 145
Jun 22 23:01:29 lcg-lrz-monitor-ext thanos@receive[1130]:         /go/pkg/mod/github.com/prometheus/prometheus@v0.48.1-0.20231212213830-d0c2d9c0b9cc/tsdb/db.go:926 +0x1571
Jun 22 23:01:31 lcg-lrz-monitor-ext thanos@receive[166890]: ts=2024-06-22T21:01:31.059900432Z caller=intrumentation.go:67 level=warn component=receive msg="changing probe status" status=not-ready reason="hashring has changed; server is not ready to receive requests"
Jun 22 23:01:31 lcg-lrz-monitor-ext thanos@receive[166890]: ts=2024-06-22T21:01:31.421109282Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : get appender: TSDB not ready" msg="internal server error"
Jun 22 23:01:33 lcg-lrz-monitor-ext thanos@receive[166890]: ts=2024-06-22T21:01:33.353567249Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : get appender: TSDB not ready" msg="internal server error"
Jun 22 23:01:37 lcg-lrz-monitor-ext thanos@receive[166890]: ts=2024-06-22T21:01:37.202569028Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : get appender: TSDB not ready" msg="internal server error"
Jun 22 23:01:42 lcg-lrz-monitor-ext thanos@receive[166890]: ts=2024-06-22T21:01:42.209377803Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : get appender: TSDB not ready" msg="internal server error"
Jun 22 23:01:43 lcg-lrz-monitor-ext thanos@receive[166890]: ts=2024-06-22T21:01:43.519402114Z caller=db.go:913 level=warn component=receive component=multi-tsdb tenant=etp.physik.lmu.de msg="Encountered WAL read error, attempting repair" err="read records: corruption in segment /data/main/thanos/receive/tsdb/etp.physik.lmu.de/wal/00020072 at 105286437: unexpected full record"
Jun 22 23:01:43 lcg-lrz-monitor-ext thanos@receive[166890]: ts=2024-06-22T21:01:43.519469388Z caller=wlog.go:428 level=warn component=receive component=multi-tsdb tenant=etp.physik.lmu.de msg="Starting corruption repair" segment=20072 offset=105286437
Jun 22 23:01:43 lcg-lrz-monitor-ext thanos@receive[166890]: ts=2024-06-22T21:01:43.519585196Z caller=wlog.go:436 level=warn component=receive component=multi-tsdb tenant=etp.physik.lmu.de msg="Deleting all segments newer than corrupted segment" segment=20072
Jun 22 23:01:47 lcg-lrz-monitor-ext thanos@receive[166890]: ts=2024-06-22T21:01:47.217663018Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : get appender: TSDB not ready" msg="internal server error"
Jun 22 23:01:52 lcg-lrz-monitor-ext thanos@receive[166890]: ts=2024-06-22T21:01:52.222077618Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : get appender: TSDB not ready" msg="internal server error"
Jun 22 23:01:54 lcg-lrz-monitor-ext thanos@receive[166890]: ts=2024-06-22T21:01:54.156891857Z caller=wlog.go:458 level=warn component=receive component=multi-tsdb tenant=etp.physik.lmu.de msg="Rewrite corrupted segment" segment=20072
Jun 22 23:01:57 lcg-lrz-monitor-ext thanos@receive[166890]: ts=2024-06-22T21:01:57.230255881Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : get appender: TSDB not ready" msg="internal server error"
Jun 22 23:02:02 lcg-lrz-monitor-ext thanos@receive[166890]: ts=2024-06-22T21:02:02.239486872Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : get appender: TSDB not ready" msg="internal server error"
Jun 22 23:02:07 lcg-lrz-monitor-ext thanos@receive[166890]: ts=2024-06-22T21:02:07.247209849Z caller=handler.go:600 level=error component=receive component=receive-handler tenant=etp.physik.lmu.de err="store locally for endpoint : get appender: TSDB not ready" msg="internal server error"

What happened is, there was a network outage in our super computing centre for the system where that is running on. Not really sure why it thinks no space left, but there was definitely always plenty of space (several TB) left. I guess the real cause was simply that the network fs was gone.
Fine. But the fs came back and systemd correctly remounted everything and the network fs was usable again.

Yet receive still claims that TSDB not ready and/or it simply never retried (the latter seems likely, as the last entry was from 22nd of June).
When I restarted the service (without doing anything else - as I've said, systemd had long ago re-mounted the remote-fs) it immediately worked again.
So the problem is really that it (Thanos) gets in some weird state and just never tries again but never fails either.

Related to that, query noticed that something was wrong:

Jul 01 02:50:47 lcg-lrz-monitor-ext thanos@query[980]: ts=2024-07-01T00:50:47.889315154Z caller=endpointset.go:392 level=warn component=endpointset msg="new endpoint creation failed" err="dialing connection: context deadline exceeded: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.1.1:10907: connect: connection refused\"" address=lcg-lrz-monitor-ext.grid.lrz.de:10907
Jul 01 02:50:52 lcg-lrz-monitor-ext thanos@query[980]: ts=2024-07-01T00:50:52.890427761Z caller=endpointset.go:392 level=warn component=endpointset msg="new endpoint creation failed" err="dialing connection: context deadline exceeded: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.1.1:10907: connect: connection refused\"" address=lcg-lrz-monitor-ext.grid.lrz.de:10907
Jul 01 02:50:57 lcg-lrz-monitor-ext thanos@query[980]: ts=2024-07-01T00:50:57.891779069Z caller=endpointset.go:392 level=warn component=endpointset msg="new endpoint creation failed" err="dialing connection: context deadline exceeded: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.1.1:10907: connect: connection refused\"" address=lcg-lrz-monitor-ext.grid.lrz.de:10907
Jul 01 02:51:02 lcg-lrz-monitor-ext thanos@query[980]: ts=2024-07-01T00:51:02.893327788Z caller=endpointset.go:392 level=warn component=endpointset msg="new endpoint creation failed" err="dialing connection: context deadline exceeded: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.1.1:10907: connect: connection refused\"" address=lcg-lrz-monitor-ext.grid.lrz.de:10907
Jul 01 02:51:07 lcg-lrz-monitor-ext thanos@query[980]: ts=2024-07-01T00:51:07.894199021Z caller=endpointset.go:392 level=warn component=endpointset msg="new endpoint creation failed" err="dialing connection: context deadline exceeded: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.1.1:10907: connect: connection refused\"" address=lcg-lrz-monitor-ext.grid.lrz.de:10907
Jul 01 02:51:12 lcg-lrz-monitor-ext thanos@query[980]: ts=2024-07-01T00:51:12.894948585Z caller=endpointset.go:392 level=warn component=endpointset msg="new endpoint creation failed" err="dialing connection: context deadline exceeded: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.1.1:10907: connect: connection refused\"" address=lcg-lrz-monitor-ext.grid.lrz.de:10907
Jul 01 02:51:17 lcg-lrz-monitor-ext thanos@query[980]: ts=2024-07-01T00:51:17.895746362Z caller=endpointset.go:392 level=warn component=endpointset msg="new endpoint creation failed" err="dialing connection: context deadline exceeded: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.1.1:10907: connect: connection refused\"" address=lcg-lrz-monitor-ext.grid.lrz.de:10907
(gazillions of these)

I agree that it makes sense in this particular case to retry (even forever) as the problematic service is no query itself, but rather another one (receive).

These errors caused some strange behaviour in the up metric for the thanos job: Screenshot from 2024-07-01 03-29-18

It went up for a we seconds and then down again (not sure why it ever went up again - it rather looks as if it shouldn't). I do have a very special alert to notice such singe scrape failures, which in fact did notice the problem and would have sent me a warning...

... unless of course alertmanager (yes I know, it's not Thanos) has the same strange design, can get in some weird internal state but simply decides to keep running without ever doing anything again:

Jun 21 11:44:51 lcg-lrz-monitor prometheus-alertmanager[618]: ts=2024-06-21T09:44:51.658Z caller=cluster.go:178 level=warn component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Jun 21 11:44:51 lcg-lrz-monitor prometheus-alertmanager[618]: ts=2024-06-21T09:44:51.661Z caller=main.go:273 level=error msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided"
Jun 21 11:44:52 lcg-lrz-monitor prometheus-alertmanager[914]: ts=2024-06-21T09:44:52.011Z caller=cluster.go:178 level=warn component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Jun 21 11:44:52 lcg-lrz-monitor prometheus-alertmanager[914]: ts=2024-06-21T09:44:52.012Z caller=main.go:273 level=error msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided"
Jun 21 11:44:52 lcg-lrz-monitor prometheus-alertmanager[1171]: ts=2024-06-21T09:44:52.259Z caller=cluster.go:178 level=warn component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Jun 21 11:44:52 lcg-lrz-monitor prometheus-alertmanager[1171]: ts=2024-06-21T09:44:52.260Z caller=main.go:273 level=error msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided"
Jun 21 11:44:52 lcg-lrz-monitor prometheus-alertmanager[1198]: ts=2024-06-21T09:44:52.500Z caller=cluster.go:178 level=warn component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Jun 21 11:44:52 lcg-lrz-monitor prometheus-alertmanager[1198]: ts=2024-06-21T09:44:52.501Z caller=main.go:273 level=error msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided"
Jun 21 11:44:52 lcg-lrz-monitor prometheus-alertmanager[1284]: ts=2024-06-21T09:44:52.763Z caller=cluster.go:178 level=warn component=cluster err="couldn't deduce an advertise address: no private IP found, explicit advertise addr not provided"
Jun 21 11:44:52 lcg-lrz-monitor prometheus-alertmanager[1284]: ts=2024-06-21T09:44:52.763Z caller=main.go:273 level=error msg="unable to initialize gossip mesh" err="create memberlist: Failed to get final advertise address: No private IP address found, and explicit IP not provided"
(here i restarted)
Jul 01 02:32:38 lcg-lrz-monitor prometheus-alertmanager[3204058]: ts=2024-07-01T00:32:38.070Z caller=main.go:498 level=warn component=configuration msg="repeat_interval is greater than the data retention period. It can lead to notifications being repeated more often than expected." repeat_interval=672h0m0s retention=120h0m0s route="{}/{instance=~\"(?i)^lcg-lrz-(?:admin|konsole)\\\\.grid\\\\.lrz\\\\.de$\"}/{alertname=\"node_upgrades_non-security_apt\"}"

Not really sure what caused that, but I presume it's in fact also related to the networking issues our computing centre had.
Yes I know, there's one day in between, but whereas it looked as our particular nodes weren't affect much when the actual issue happened, they still restarted all VMs (which I think was the later time, when receive went nuts).

So it claims something like no IP found... not really sure why, the IPs are statically configured, so even if networking was down there should have been an IP.
Also, as before, I didn't "fix" anything with respect to that - but when just now I've noticed the whole mess - all IPs were bound, all links were up, all networking worked.

So same case like in Thanos just this time Prometheus, which stumbles over something (which per se is of course fine) - but thinks it would be better to keep running without ever trying anything again.

The dead alertmanager in turn, swallowed any alerts (including the scrape failures for the thanos job from above).

What you expected to happen:

TBH (and don't take this as a rant or so), I don't know how one should be able to use Prometheus/Thanos in production when such issues seem to exist in numerous places. Consider one wants to monitor e.g. RAIDs with Prometheus/Thanos and broken drives go unnoticed because either Prometheus or Thanos get into such state where they are dysfunctional, don't retry and don't fail either (which would allow people - e.g. via systemd - to notice it).
While the two knocked themselves out, further drives fail until data gets lost.

So really, how is one expected to counter such issues and run the two in a useful way?

Environment:

Cheers, Chris

Yassineoudacha commented 1 month ago

1xbet crash

wiardvanrij commented 1 month ago

Hello Chris,

The Thanos project is open-source, meaning anyone can contribute to improve it. Many maintainers and contributors often work on it in their own time, and some may receive partial sponsorship (i.e. some time during working hours) from their companies if the company benefits from the software.

We love our community and everyone who uses our software. We aim to create great software, but there is only so much we can do. Instead of asking numerous questions about other projects and assuming the obvious solution, why not help us make the project better for everyone? I would love to see that mindset instead. I do understand your frustration, but some parts of your message is quite frankly unconstructive and less great to parse to those who invest their time and effort into the project. Let's be positive and constructive. Thank you.

Regarding your question, about 99.5% of Thanos users run it on Kubernetes, often in HA on different nodes, zones, or even regions. Monitoring of the monitoring is in place, and a deadman-switch can be implemented as a last resort. Given the widespread use of Kubernetes, there's less focus on running Thanos as a daemon, so improvements could be made in that area. For example, Kubernetes often includes health checks and liveness checks. It could be that we should definitely put some love in that part, which would benefit everyone.

Additionally, a panic should ideally return an exit code 1, which is worth investigating and potentially changing as well. So, thanks for the logs.