Open PabloPie opened 4 months ago
We also run into this behaviour. We are running thanos v0.36.0 following sidecar approach and we observed that rules may stop evaluating for certain groups after running the ruler for a while. The only work around is to restart pods.
cc @verejoel - did you solve that issue? I faintly remember that we discussed it on slack
I think our ruler http client is missing a timeout
If it helps we managed to capture debug/pprof/goroutine
output when ThanosRuleNoEvaluationFor10Intervals alert was triggered.
09-09-24-thanos-rule-pprof-goroutine.v2.txt 09-09-24-thanos-rule-pprof-goroutine-debug-2.v2.txt Build Information
version 0.36.0
revision cfff5518d37756715a1cee43c42b21f4afccfbf2
branch HEAD
buildUser root@317f3d9783e7
buildDate 20240731-15:23:35
goVersion go1.21.12
I think our ruler http client is missing a timeout
@MichaHoffmann having a quick look through the code, this will mean adding a default overall request timeout to the http client here: https://github.com/thanos-io/thanos/blob/main/pkg/clientconfig/http.go#L271? If the change is that simple, maybe we can help PRing since this issue is heavily affecting us now in a very busy Kubernetes cluster.
Yeah I think that this at least should make this an error instead of a deadlock. We should take a timeout from config though, technically this is also a breaking change I guess but I think for the better. PR is very welcome, thank you for offering
Upgrading to 0.36.1 fixes this issue for us. We no longer see rules that hang. However, do agree that setting a reasonable timeout makes sense in this case. As discussed with @MichaHoffmann setting it to the rule group evaluation interval might make sense.
Thanos, Prometheus and Golang version used: Thanos v0.35.1, Prometheus v2.51.0
Object Storage Provider: GCS
What happened: After upgrading Thanos components from v0.32.4 to v0.35.1, Thanos Ruler progressively stops evaluating all recording rules. Evaluations in the UI don't show any errors, and Thanos Ruler doesn't log any errors either. Last evaluation time keeps increasing without any evaluations being done until Thanos Ruler is restarted. After a rollback of all components to v0.32.4, we noticed that the issue happens even when you only upgrade Thanos Query to v0.35.1.
What you expected to happen: Thanos Ruler respects the evaluations interval or logs an error explaining why the evaluation stopped.
How to reproduce it (as minimally and precisely as possible): We are running Thanos using the sidecar approach. Running Thanos Ruler against a Thanos Query v0.35.1 through a Thanos Query frontend (any version) reproduces the issue. It takes an arbitrary amount of time before Thanos Ruler stops doing evaluations (we have seen 5 minutes up to 3 hours until it stops).
Full logs to relevant components: No relevant logs
Anything else we need to know: We run evaluations in abort mode when there is a warning and since the upgrade of Thanos Query we get the new warning about the counters having a non-standard name, which makes me wonder if it's related to https://github.com/thanos-io/thanos/issues/7354