kedacore / charts

Helm charts for KEDA
Apache License 2.0
155 stars 220 forks source link

can you provide prometheus alert rules? #297

Open xgengsjc2021 opened 2 years ago

xgengsjc2021 commented 2 years ago

I found there is zero rules in the values. yaml. Can you provide some rules for promtheus monitoring purpose? Can you provide more rules? Is this one enough in values.yaml?

# - alert: KedaScalerErrors
        #   annotations:
        #     description: Keda scaledObject {{ $labels.scaledObject }} is experiencing errors with {{ $labels.scaler }} scaler
        #     summary: Keda Scaler {{ $labels.scaler }} Errors
        #   expr: sum by ( scaledObject , scaler) (rate(keda_metrics_adapter_scaler_errors[2m]))  > 0
        #   for: 2m
        #   labels:
tomkerkhove commented 2 years ago

We don't actively add them given this is up to the end-user; we don't want to enforce things on end-users.

However, if you have suggestions we do welcome PRs where we can add them, but commented out.

xgengsjc2021 commented 2 years ago

@tomkerkhove Where can I see all of metrics KEDA provided? I could not find any metrics on keda website, so I dont know what metrics I can use to setup alerts.

tomkerkhove commented 2 years ago

The overview is available on https://keda.sh/docs/2.7/operate/prometheus/

xgengsjc2021 commented 2 years ago

@tomkerkhove Thanks. When I check "http://127.0.0.1:9022/metrics", I only get one metric. (However, in the document, they list 4 metrics. ) Please see my screenshot. Screenshot2022_08_03_133444

Why does KEDA only list one metric here? Btw, I am using the latest version of KEDA(2.7.2).

From the screenshot, you can see the metric name is: keda_metrics_adapter_scaler_errors_total (In the document, the metric name is keda_metrics_adapter_scaler_error_totals) From my metric, I can get result, but if I queried keda_metrics_adapter_scaler_error_totals, I get nothing.

Besides this, I also tried to query three other metrics below on Prometheus, did not get any result.

keda_metrics_adapter_scaled_object_error_totals
keda_metrics_adapter_scaler_errors
keda_metrics_adapter_scaler_metrics_value

My KEDA setting for Prometheus


prometheus:
    metricServer:
      enabled: true
      port: 9022
      portName: metrics
      path: /metrics
      podMonitor:
        # Enables PodMonitor creation for the Prometheus Operator
        enabled: true
        interval:
        scrapeTimeout:
        namespace: monitoring
        additionalLabels: 
          release: kube-prometheus-stack
        relabelings: []
    operator:
      enabled: true
      port: 8080
      path: /metrics
      podMonitor:
        # Enables PodMonitor creation for the Prometheus Operator
        enabled: true
        interval:
        scrapeTimeout:
        namespace: monitoring
        additionalLabels: 
          release: kube-prometheus-stack

Is there anything I need to modify? Please help check. Thanks

xgengsjc2021 commented 2 years ago

Screenshot2022_08_03_140114

xgengsjc2021 commented 2 years ago

@tomkerkhove Hi, do you have any idea about what I commented above?

tomkerkhove commented 2 years ago

This might be a silly question, but does your cluster have any ScaledObject resources? Because if it does not, then that might explain why they are missing.

(sorry for the slow response)

xgengsjc2021 commented 2 years ago

@tomkerkhove Thanks but the reply made me confused. I do have scaledobject resources in my env.

Screenshot2022_08_23_093215

tomkerkhove commented 2 years ago

That is odd, can this be related to https://github.com/kedacore/keda/issues/3554 @JorTurFer ?

JorTurFer commented 2 years ago

I don't think so, that issue registers the metric with 0 as value, but the metric is registered. are you checking the metrics server (not the operator) in the port 9022, right?

xgengsjc2021 commented 2 years ago

@JorTurFer @tomkerkhove Thanks for the response. I did check the metrics server, (not the operator). Btw, from the output below, in my keda ns, I only see on service, keda-operator-metrics-apiserver

kubectl -n keda get svc
NAME                              TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                   AGE
keda-operator-metrics-apiserver   ClusterIP   172.20.76.175   <none>        443/TCP,80/TCP,9022/TCP   147d

Then I run port-forward to check the metrics on metrics-apiserver

kubectl -n keda port-forward svc/keda-operator-metrics-apiserver 9022
Forwarding from 127.0.0.1:9022 -> 9022
Forwarding from [::1]:9022 -> 9022
Handling connection for 9022
Handling connection for 9022
JorTurFer commented 2 years ago

hum weird... It's possible that after metrics server restart, the counters don't exist because they are created during the first access. Do you have more than 1 metrics server instance? If you restart the metrics server and wait some minutes (having SOs), don't you have any metric either?

xgengsjc2021 commented 2 years ago

@JorTurFer I only have 1 metrics server pod as you can see below.

kubectl -n keda get pods
NAME                                               READY   STATUS    RESTARTS   AGE
keda-operator-848b9f56f7-5szlp                     1/1     Running   0          27d
keda-operator-metrics-apiserver-5cb9fd7947-gv47p   1/1     Running   0          19m

Based on your suggestion, I rebooted metric pod, then checked the http://127.0.0.1:9022/metrics, got the same result

# HELP keda_metrics_adapter_scaler_errors_total Total number of errors for all scalers
# TYPE keda_metrics_adapter_scaler_errors_total counter
keda_metrics_adapter_scaler_errors_total 0
JorTurFer commented 2 years ago

really weird... Could you query some metric manually to ensure that at least one trigger is executed? I have just seen in your picture that you are using CPU trigger, that trigger is processed by the Kubernetes metrics server (not by KEDA metrics server) and that's why you can't see any other metric. Do you have any trigger which is not CPU/Memory? Could you query it manuall?

xgengsjc2021 commented 2 years ago

@JorTurFer At this moment, we only monitor CPU and Mem. I dont have any other triggers for now. Actually, we are using the combination of CPU+Mem together as trigger in our env for now.

Here is a question hope you can answer: Once the condition got matched in KEDA, will keda scale up the pods to the maximum number at once? I noticed one time, my CPU usage is not too high(It was above the threshold), but it scaled up to the maximum pods immediately, which we dont like it.

JorTurFer commented 2 years ago

@JorTurFer At this moment, we only monitor CPU and Mem. I dont have any other triggers for now. Actually, we are using the combination of CPU+Mem together as trigger in our env for now.

That's why you can't see any other metric, because they are not generated yet due to KEDA metrics server hasn't received any query, all the requests are done against Kubernetes metric server. When you use CPU/Memory scaler, KEDA basically create a "regular" HPA hitting to the "regular" metrics server (that's why Kubernetes metrics server is needed)

Here is a question hope you can answer: Once the condition got matched in KEDA, will keda scale up the pods to the maximum number at once? I noticed one time, my CPU usage is not too high (It was above the threshold), but it scaled up to the maximum pods immediately, which we dont like it.

KEDA creates the HPA and exposes the metrics (except CPU and memory) and is the HPA Controller who manages the autoscaling,, so basically we don't have any change there. Why do you think that the CPU usage was low? I mean, do you have all the usage monitored? Another important thing is that the threshold is not a boundary, it's the desired value. I mean, the HPA Controller will try to be closest as possible to that value, not scaling out/in automatically when the value changes. Remember also that HPA Controller is really aggressive scaling out and very conservator scaling in, a small peak could trigger the scaling out and several minutes are needed to scaling in. Using KEDA you can customize this default behaviour using advanced section in the ScaledObject.