Ruler: Evaluations stop without errors or logs

PabloPie commented 4 months ago

Thanos, Prometheus and Golang version used: Thanos v0.35.1, Prometheus v2.51.0

Object Storage Provider: GCS

What happened: After upgrading Thanos components from v0.32.4 to v0.35.1, Thanos Ruler progressively stops evaluating all recording rules. Evaluations in the UI don't show any errors, and Thanos Ruler doesn't log any errors either. Last evaluation time keeps increasing without any evaluations being done until Thanos Ruler is restarted. After a rollback of all components to v0.32.4, we noticed that the issue happens even when you only upgrade Thanos Query to v0.35.1.

What you expected to happen: Thanos Ruler respects the evaluations interval or logs an error explaining why the evaluation stopped.

How to reproduce it (as minimally and precisely as possible): We are running Thanos using the sidecar approach. Running Thanos Ruler against a Thanos Query v0.35.1 through a Thanos Query frontend (any version) reproduces the issue. It takes an arbitrary amount of time before Thanos Ruler stops doing evaluations (we have seen 5 minutes up to 3 hours until it stops).

Full logs to relevant components: No relevant logs

Anything else we need to know: We run evaluations in abort mode when there is a warning and since the upgrade of Thanos Query we get the new warning about the counters having a non-standard name, which makes me wonder if it's related to https://github.com/thanos-io/thanos/issues/7354

ffilippopoulos commented 2 months ago

We also run into this behaviour. We are running thanos v0.36.0 following sidecar approach and we observed that rules may stop evaluating for certain groups after running the ruler for a while. The only work around is to restart pods.

MichaHoffmann commented 2 months ago

cc @verejoel - did you solve that issue? I faintly remember that we discussed it on slack

MichaHoffmann commented 2 months ago

I think our ruler http client is missing a timeout

asiyani commented 2 months ago

If it helps we managed to capture debug/pprof/goroutine output when ThanosRuleNoEvaluationFor10Intervals alert was triggered.

09-09-24-thanos-rule-pprof-goroutine.v2.txt 09-09-24-thanos-rule-pprof-goroutine-debug-2.v2.txt Build Information

version 0.36.0
revision    cfff5518d37756715a1cee43c42b21f4afccfbf2
branch  HEAD
buildUser   root@317f3d9783e7
buildDate   20240731-15:23:35
goVersion   go1.21.12

ffilippopoulos commented 2 months ago

I think our ruler http client is missing a timeout

@MichaHoffmann having a quick look through the code, this will mean adding a default overall request timeout to the http client here: https://github.com/thanos-io/thanos/blob/main/pkg/clientconfig/http.go#L271? If the change is that simple, maybe we can help PRing since this issue is heavily affecting us now in a very busy Kubernetes cluster.

MichaHoffmann commented 2 months ago

Yeah I think that this at least should make this an error instead of a deadlock. We should take a timeout from config though, technically this is also a breaking change I guess but I think for the better. PR is very welcome, thank you for offering

verejoel commented 3 weeks ago

Upgrading to 0.36.1 fixes this issue for us. We no longer see rules that hang. However, do agree that setting a reasonable timeout makes sense in this case. As discussed with @MichaHoffmann setting it to the rule group evaluation interval might make sense.

thanos-io / thanos

Ruler: Evaluations stop without errors or logs #7536