Cluster is raising "ThanosRuleRuleEvaluationLatencyHigh" Warning in OpenShift 4.14

sustainable-computing-io / kepler

Kepler (Kubernetes-based Efficient Power Level Exporter) uses eBPF to probe performance counters and other system stats, use ML models to estimate workload energy consumption based on these stats, and exports them as Prometheus metrics

https://sustainable-computing.io

Apache License 2.0

1.07k stars 169 forks source link

Cluster is raising "ThanosRuleRuleEvaluationLatencyHigh" Warning in OpenShift 4.14 #1105

Open rgordill opened 7 months ago

rgordill commented 7 months ago

What happened?

When deploying the kepler operator (default instance), after some time the thanos ruler takes too long to exec the rules and start raising this alert:

Thanos Rule 10.128.2.207:9092 in Namespace NS openshift-user-workload-monitoring has higher evaluation latency than interval for /thanos/data/.tmp-rules/ABORT/etc/thanos/rules/thanos-ruler-user-workload-rulefiles-0/openshift-kepler-operator-kepler-exporter-prom-rules-ed6d3a57-9819-4af5-a557-97df184cd242.yaml;kepler.rules.

What did you expect to happen?

No alerts raising.

How can we reproduce it (as minimally and precisely as possible)?

Fresh new 4.14 OpenShift installation (Bare Metal with OpenShift Data Foundation)

Anything else we need to know?

No response

Kepler image tag

quay.io/sustainable_computing_io/kepler:release-0.6.1 quay.io/sustainable_computing_io/kepler@sha256:314949285b3be103bf26ac4f2dd0a3301cac0f3d0a36ffe97f62dbc0bd5a3f99

Kubernetes version

```console $ kubectl version Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.4", GitCommit:"9b1e0d27df3cf7b2ea878cd668ce709cc4e4c41a", GitTreeState:"clean", BuildDate:"2023-11-03T06:26:26Z", GoVersion:"go1.20.10 X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v5.0.1 Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.6+f67aeb3", GitCommit:"f3ec0ed759cde48849b6e3117c091b7db90c95fa", GitTreeState:"clean", BuildDate:"2023-10-20T22:20:44Z", GoVersion:"go1.20.10 X:strictfipsruntime", Compiler:"gc", Platform:"linux/amd64"} ```

Cloud provider or bare metal

Bare Metal servers

OS version

```console # On Linux: $ cat /etc/os-release # paste output here $ uname -a # paste output here # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Kepler deployment config

For on kubernetes: ```console $ KEPLER_NAMESPACE=kepler # provide kepler configmap $ kubectl get configmap kepler-cfm -n ${KEPLER_NAMESPACE} # paste output here # provide kepler deployment description $ kubectl describe deployment kepler-exporter -n ${KEPLER_NAMESPACE} ``` For standalone: # put your Kepler command argument here

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

vimalk78 commented 7 months ago

@rgordill can you mention how long after enabling user-workload-monitoring do you see this Alert?

vimalk78 commented 7 months ago

Cc: @sthaha

rgordill commented 7 months ago

Less than one day.

rootfs commented 7 months ago

@rgordill can you follow these steps and share the diagnosis? https://github.com/openshift/runbooks/blob/master/alerts/cluster-monitoring-operator/ThanosRuleRuleEvaluationLatencyHigh.md

sthaha commented 7 months ago

@rgordill To add to the above, could also please share the following details

version of the operator deployed ( I am guessing 0.9.2)
size of the cluster
- number of nodes
- number of pods
- number of containers
the rule that failed to evaluate from the thanos rules logs (or even the logs)
the UWM config

rgordill commented 7 months ago

thanos-query.log thanos-ruler.log

Operator Version:

oc get ClusterServiceVersion -n openshift-operators |grep kepler
kepler-operator.v0.9.2              Kepler                                                    0.9.2          kepler-operator.v0.9.0              Succeeded

Number of nodes:

oc get nodes --no-headers |wc -l
6

Number of pods:

oc get pods --no-headers -A |wc -l
345

Number of containers:

oc get pods -A -o jsonpath="{.items[*].spec['initContainers', 'containers'][*].name}" |tr -s '[[:space:]]' '\n' |wc -l
785

The offending rule from the alert is clear from the alert:

[FIRING:2] openshift-user-workload-monitoring (ThanosRuleRuleEvaluationLatencyHigh thanos-ruler platform openshift-monitoring/k8s /thanos/data/.tmp-rules/ABORT/etc/thanos/rules/thanos-ruler-user-workload-rulefiles-0/openshift-kepler-operator-kepler-exporter-prom-rules-ed6d3a57-9819-4af5-a557-97df184cd242.yaml;kepler.rules warning)

The UWM is very easy:

apiVersion: v1
kind: ConfigMap
metadata:
  name: user-workload-monitoring-config
  namespace: openshift-user-workload-monitoring
data:
  config.yaml: |
    prometheus: 
      retention: 7d 
      resources:
        requests:
          cpu: 200m 
          memory: 2Gi 
      volumeClaimTemplate:
        spec:
          resources:
            requests:
              storage: 20Gi
    alertmanager:
      enabled: true
      enableAlertmanagerConfig: true

TitaniumBoy commented 7 months ago

I would like to just add that we are also experiencing this alert for the last few weeks also...

Here are our details as per rgordill's post.

Operator: oc get ClusterServiceVersion -n openshift-operators |grep kepler

kepler-operator.v0.9.2 Kepler 0.9.2 kepler-operator.v0.9.0 Succeeded

oc get nodes --no-headers |wc -l 6

oc get pods --no-headers -A |wc -l 6164

oc get pods -A -o jsonpath="{.items[*].spec['initContainers', 'containers'][*].name}" |tr -s '[[:space:]]' '\n' |wc -l 12007

Alert is clear and the same as above: ThanosRuleRuleEvaluationLatencyHigh Thanos Rule XXX.XXX.XXX.XXX:9092 in Namespace [openshift-user-workload-monitoring] has higher evaluation latency than interval for /thanos/data/.tmp-rules/ABORT/etc/thanos/rules/thanos-ruler-user-workload-rulefiles-0/openshift-kepler-operator-kepler-exporter-prom-rules.yaml;kepler.rules.

Is there anyway to adjust the timeout value for the query to succeed please?

sthaha commented 7 months ago

Is there anyway to adjust the timeout value for the query to succeed please?

Unfortunately not. The rule is deployed by operator and changes to the rules will be reverted back by the operator.

@TitaniumBoy @rgordill , Since the rule evaluation takes long, could you please help us understand how many time-series are active?

what does prometheus_tsdb_head_series give you?

TitaniumBoy commented 6 months ago

Sorry for the late response. Here are the current values from prometheus_tsdb_head_series

prometheus_tsdb