Open rgordill opened 7 months ago
@rgordill can you mention how long after enabling user-workload-monitoring do you see this Alert?
Cc: @sthaha
Less than one day.
@rgordill can you follow these steps and share the diagnosis? https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ThanosRuleRuleEvaluationLatencyHigh.md
@rgordill To add to the above, could also please share the following details
version of the operator deployed ( I am guessing 0.9.2)
size of the cluster
the rule that failed to evaluate from the thanos rules logs (or even the logs)
the UWM config
thanos-query.log thanos-ruler.log
Operator Version:
oc get ClusterServiceVersion -n openshift-operators |grep kepler
kepler-operator.v0.9.2 Kepler 0.9.2 kepler-operator.v0.9.0 Succeeded
Number of nodes:
oc get nodes --no-headers |wc -l
6
Number of pods:
oc get pods --no-headers -A |wc -l
345
Number of containers:
oc get pods -A -o jsonpath="{.items[*].spec['initContainers', 'containers'][*].name}" |tr -s '[[:space:]]' '\n' |wc -l
785
The offending rule from the alert is clear from the alert:
[FIRING:2] openshift-user-workload-monitoring (ThanosRuleRuleEvaluationLatencyHigh thanos-ruler platform openshift-monitoring/k8s /thanos/data/.tmp-rules/ABORT/etc/thanos/rules/thanos-ruler-user-workload-rulefiles-0/openshift-kepler-operator-kepler-exporter-prom-rules-ed6d3a57-9819-4af5-a557-97df184cd242.yaml;kepler.rules warning)
The UWM is very easy:
apiVersion: v1
kind: ConfigMap
metadata:
name: user-workload-monitoring-config
namespace: openshift-user-workload-monitoring
data:
config.yaml: |
prometheus:
retention: 7d
resources:
requests:
cpu: 200m
memory: 2Gi
volumeClaimTemplate:
spec:
resources:
requests:
storage: 20Gi
alertmanager:
enabled: true
enableAlertmanagerConfig: true
I would like to just add that we are also experiencing this alert for the last few weeks also...
Here are our details as per rgordill's post.
Operator:
oc get ClusterServiceVersion -n openshift-operators |grep kepler
kepler-operator.v0.9.2 Kepler 0.9.2 kepler-operator.v0.9.0 Succeeded
oc get nodes --no-headers |wc -l
6
oc get pods --no-headers -A |wc -l
6164
oc get pods -A -o jsonpath="{.items[*].spec['initContainers', 'containers'][*].name}" |tr -s '[[:space:]]' '\n' |wc -l
12007
Alert is clear and the same as above: ThanosRuleRuleEvaluationLatencyHigh Thanos Rule XXX.XXX.XXX.XXX:9092 in Namespace [openshift-user-workload-monitoring] has higher evaluation latency than interval for /thanos/data/.tmp-rules/ABORT/etc/thanos/rules/thanos-ruler-user-workload-rulefiles-0/openshift-kepler-operator-kepler-exporter-prom-rules.yaml;kepler.rules.
Is there anyway to adjust the timeout value for the query to succeed please?
Is there anyway to adjust the timeout value for the query to succeed please?
Unfortunately not. The rule is deployed by operator and changes to the rules will be reverted back by the operator.
@TitaniumBoy @rgordill , Since the rule evaluation takes long, could you please help us understand how many time-series are active?
what does prometheus_tsdb_head_series
give you?
Sorry for the late response. Here are the current values from prometheus_tsdb_head_series
What happened?
When deploying the kepler operator (default instance), after some time the thanos ruler takes too long to exec the rules and start raising this alert:
Thanos Rule 10.128.2.207:9092 in Namespace NS openshift-user-workload-monitoring has higher evaluation latency than interval for /thanos/data/.tmp-rules/ABORT/etc/thanos/rules/thanos-ruler-user-workload-rulefiles-0/openshift-kepler-operator-kepler-exporter-prom-rules-ed6d3a57-9819-4af5-a557-97df184cd242.yaml;kepler.rules.
What did you expect to happen?
No alerts raising.
How can we reproduce it (as minimally and precisely as possible)?
Fresh new 4.14 OpenShift installation (Bare Metal with OpenShift Data Foundation)
Anything else we need to know?
No response
Kepler image tag
Kubernetes version
Cloud provider or bare metal
OS version
Install tools
Kepler deployment config
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)