grafana / helm-charts

Apache License 2.0
1.67k stars 2.28k forks source link

[tempo-distributed] fix: add autoscaling for tempo-distributed metrics-generator #3430

Open msvechla opened 1 week ago

msvechla commented 1 week ago

This adds autoscaling via hpa and keda for tempo-distributed metrics-generator. Implementation is analog to the already existing autoscaling options for e.g. the compactor.

HPA Example

helm template tempo . --set metricsGenerator.enabled=true --set metricsGenerator.autoscaling.enabled=true --set metricsGenerator.autoscaling.hpa.enabled=true --show-only templates/metrics-generator/hpa.yaml
---
# Source: tempo-distributed/templates/metrics-generator/hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: tempo-metrics-generator
  namespace: default
  labels:
    helm.sh/chart: tempo-distributed-1.22.0
    app.kubernetes.io/name: tempo
    app.kubernetes.io/instance: tempo
    app.kubernetes.io/component: metrics-generator
    app.kubernetes.io/version: "2.6.0"
    app.kubernetes.io/managed-by: Helm
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tempo-metrics-generator
  minReplicas: 1
  maxReplicas: 3
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 100

KEDA Example

helm template tempo . --set metricsGenerator.enabled=true --set metricsGenerator.autoscaling.enabled=true <...> --show-only templates/metrics-generator/keda-scaled-object.yaml
# Source: tempo-distributed/templates/metrics-generator/keda-scaled-object.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: tempo-metrics-generator
  namespace: default
  labels:
    helm.sh/chart: tempo-distributed-1.22.0
    app.kubernetes.io/name: tempo
    app.kubernetes.io/instance: tempo
    app.kubernetes.io/component: metrics-generator
    app.kubernetes.io/version: "2.6.0"
    app.kubernetes.io/managed-by: Helm
spec:
  minReplicaCount: 1
  maxReplicaCount: 3
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: tempo-metrics-generator
  triggers:
  - type: "prometheus"
    metadata:
      serverAddress: http://<prometheus-host>:9090
      threshold: "250"
      query: |
        sum(prometheus_remote_storage_shards_desired{job="default/metrics-generator"} /
        prometheus_remote_storage_shards_max{job="default/metrics-generator"})by(job)

Let me know if this can be merged or if further adjustment is required. Thanks!

msvechla commented 4 days ago

I rebased