thanos-io / thanos

Highly available Prometheus setup with long term storage capabilities. A CNCF Incubating project.
https://thanos.io
Apache License 2.0
12.73k stars 2.04k forks source link

Ruler evaluation warning false alarm caused by engine warnings #7354

Open yeya24 opened 1 month ago

yeya24 commented 1 month ago

Problem

Query response warnings were used in Thanos to propagate partial response information of Store APIs.

https://thanos.io/tip/components/rule.md/#must-have-essential-ruler-alerts recommends setting alarm on thanos_rule_evaluation_with_warnings_total metric and we have this alert on Thanos mixins as well.

thanos_rule_evaluation_with_warnings_total. If you choose to use Rules and Alerts with [partial response strategy’s](https://thanos.io/tip/components/rule.md/#partial-response) value as “warn”, this metric will tell you how many evaluation ended up with some kind of warning. To see the actual warnings see WARN log level. This might suggest that those evaluations return partial response and might not be accurate.

However, this metric becomes broken since Prometheus started to propagate warnings from the engine https://github.com/prometheus/prometheus/pull/12152. For example, metric name doesn't end with _total will result a warning and cause thanos_rule_evaluation_with_warnings_total metric to increase and trigger the alarm.

Proposal

yeya24 commented 1 month ago

Any idea how to fix this issue? Currently what I am thinking is to move the partial response warning metric to Thanos Querier and remove it from Ruler. Thanos Querier is able to detect whether the warning is coming from the storage layer or from the engine so we can emit the correct metric for partial response only.

yeya24 commented 1 week ago

https://github.com/prometheus/prometheus/issues/14135

Umm, actually this doesn't help solving the problem. There is still warnings from PromQL engine. I think what we might be able to do is to check whether the warning is PromQL warning or not https://github.com/prometheus/prometheus/pull/14327/files#diff-d8edc673795ebafcce64036e85a031d0fd3d09cbcaea9bb1cc140373542afc92R91 and separate that.