Open yeya24 opened 1 month ago
Any idea how to fix this issue? Currently what I am thinking is to move the partial response warning metric to Thanos Querier and remove it from Ruler. Thanos Querier is able to detect whether the warning is coming from the storage layer or from the engine so we can emit the correct metric for partial response only.
https://github.com/prometheus/prometheus/issues/14135
Umm, actually this doesn't help solving the problem. There is still warnings from PromQL engine. I think what we might be able to do is to check whether the warning is PromQL warning or not https://github.com/prometheus/prometheus/pull/14327/files#diff-d8edc673795ebafcce64036e85a031d0fd3d09cbcaea9bb1cc140373542afc92R91 and separate that.
Problem
Query response warnings were used in Thanos to propagate partial response information of Store APIs.
https://thanos.io/tip/components/rule.md/#must-have-essential-ruler-alerts recommends setting alarm on
thanos_rule_evaluation_with_warnings_total
metric and we have this alert on Thanos mixins as well.However, this metric becomes broken since Prometheus started to propagate warnings from the engine https://github.com/prometheus/prometheus/pull/12152. For example, metric name doesn't end with
_total
will result a warning and causethanos_rule_evaluation_with_warnings_total
metric to increase and trigger the alarm.Proposal
thanos_rule_evaluation_with_warnings_total
, let's include warnings from partial response only or