Open calestyo opened 2 months ago
1xbet crash
Hello Chris,
The Thanos project is open-source, meaning anyone can contribute to improve it. Many maintainers and contributors often work on it in their own time, and some may receive partial sponsorship (i.e. some time during working hours) from their companies if the company benefits from the software.
We love our community and everyone who uses our software. We aim to create great software, but there is only so much we can do. Instead of asking numerous questions about other projects and assuming the obvious solution, why not help us make the project better for everyone? I would love to see that mindset instead. I do understand your frustration, but some parts of your message is quite frankly unconstructive and less great to parse to those who invest their time and effort into the project. Let's be positive and constructive. Thank you.
Regarding your question, about 99.5% of Thanos users run it on Kubernetes, often in HA on different nodes, zones, or even regions. Monitoring of the monitoring is in place, and a deadman-switch can be implemented as a last resort. Given the widespread use of Kubernetes, there's less focus on running Thanos as a daemon, so improvements could be made in that area. For example, Kubernetes often includes health checks and liveness checks. It could be that we should definitely put some love in that part, which would benefit everyone.
Additionally, a panic should ideally return an exit code 1, which is worth investigating and potentially changing as well. So, thanks for the logs.
Thanos, Prometheus and Golang version used:
Object Storage Provider: filesystem
What happened: I've seen that already before with
compact
(#7197), but it seems rather a general critical design problem in thanos, which now also occurred at least inalertmanager
andreceive
.Thanos seems incapable to simply fail in case of an error, but keeps running forever, but never actually re-trying to continue.
In the current case two services were affected. First,
receive
:What happened is, there was a network outage in our super computing centre for the system where that is running on. Not really sure why it thinks
no space left
, but there was definitely always plenty of space (several TB) left. I guess the real cause was simply that the network fs was gone.Fine. But the fs came back and systemd correctly remounted everything and the network fs was usable again.
Yet receive still claims that
TSDB not ready
and/or it simply never retried (the latter seems likely, as the last entry was from 22nd of June).When I restarted the service (without doing anything else - as I've said, systemd had long ago re-mounted the remote-fs) it immediately worked again.
So the problem is really that it (Thanos) gets in some weird state and just never tries again but never fails either.
Related to that,
query
noticed that something was wrong:I agree that it makes sense in this particular case to retry (even forever) as the problematic service is no
query
itself, but rather another one (receive
).These errors caused some strange behaviour in the
up
metric for thethanos
job:It went up for a we seconds and then down again (not sure why it ever went up again - it rather looks as if it shouldn't). I do have a very special alert to notice such singe scrape failures, which in fact did notice the problem and would have sent me a warning...
... unless of course
alertmanager
(yes I know, it's not Thanos) has the same strange design, can get in some weird internal state but simply decides to keep running without ever doing anything again:Not really sure what caused that, but I presume it's in fact also related to the networking issues our computing centre had.
Yes I know, there's one day in between, but whereas it looked as our particular nodes weren't affect much when the actual issue happened, they still restarted all VMs (which I think was the later time, when
receive
went nuts).So it claims something like no IP found... not really sure why, the IPs are statically configured, so even if networking was down there should have been an IP.
Also, as before, I didn't "fix" anything with respect to that - but when just now I've noticed the whole mess - all IPs were bound, all links were up, all networking worked.
So same case like in Thanos just this time Prometheus, which stumbles over something (which per se is of course fine) - but thinks it would be better to keep running without ever trying anything again.
The dead alertmanager in turn, swallowed any alerts (including the scrape failures for the
thanos
job from above).What you expected to happen:
TBH (and don't take this as a rant or so), I don't know how one should be able to use Prometheus/Thanos in production when such issues seem to exist in numerous places. Consider one wants to monitor e.g. RAIDs with Prometheus/Thanos and broken drives go unnoticed because either Prometheus or Thanos get into such state where they are dysfunctional, don't retry and don't fail either (which would allow people - e.g. via systemd - to notice it).
While the two knocked themselves out, further drives fail until data gets lost.
So really, how is one expected to counter such issues and run the two in a useful way?
Environment:
uname -a
): 6.1.90-1Cheers, Chris