grafana / helm-charts

Apache License 2.0
1.63k stars 2.26k forks source link

[loki-distributed] Add configurable scaling behaviour and KEDA autoscaler #2126

Open zaldnoay opened 1 year ago

zaldnoay commented 1 year ago

Loki's document recommend using KEDA in querier to configure autoscaling based on Prometheus metrics. Also the default scaling behaviour is too frequent for Loki's components. I recommend adding a configurable scaling behaviour to the values and templates to make deployment more stable and flexible. Here are some of the examples I wrote:

values.yaml:

querier:
  autoscaling:
    scaler: native # native or keda
    behavior: {}
    # Configure KEDA Prometheus trigger.
    # See also: https://keda.sh/docs/latest/scalers/prometheus/
    targetMetricsConfigure:
      query: sum(max_over_time(cortex_query_scheduler_inflight_requests{namespace="loki-cluster", quantile="0.75"}[2m]))
      serverAddress: http://prometheus.default:9090/prometheus
      threshold: 4

templates:

# hpa.yaml
{{- if .Values.querier.autoscaling.enabled }}
{{- if eq .Values.querier.autoscaling.scaler "native" }}
{{- $apiVersion := include "loki.hpa.apiVersion" . -}}
apiVersion: {{ $apiVersion }}
kind: HorizontalPodAutoscaler
# ...
spec:
# ...
  {{- if (eq $apiVersion "autoscaling/v2") }}
  {{- with .Values.querier.autoscaling.behavior }}
  behavior:
    {{- toYaml . | nindent 4 }}
  {{- end }}
  {{- end }}
{{- else if eq .Values.querier.autoscaling.scaler "keda" }}
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
# ...
spec:
# ...
  {{- with .Values.querier.autoscaling.behavior }}
  advanced:
    horizontalPodAutoscalerConfig:
      behavior:
        {{- toYaml . | nindent 8 }}
  {{- end }}
  triggers:
    {{- with .Values.querier.autoscaling.targetCPUUtilizationPercentage }}
    - type: cpu
      metricType: Utilization
      metadata:
        value: "60"
    {{- end }}
    # ...
    {{- with .Values.querier.autoscaling.targetMetricsConfigure }}
    - metadata:
        metricName: querier_autoscaling_metric
        query: {{ .query }}
        serverAddress: {{ .serverAddress }}
        threshold: {{ .threshold }}
      type: prometheus
    {{- end }}
{{- end }}
{{- end }}

Questions are welcome.

KEDA document: https://keda.sh/docs/latest/concepts/scaling-deployments/ K8S HPA document: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#default-behavior

jfusterm commented 1 year ago

We had exactly the same issue.

We wanted Loki to scale/downscale more steadily by tuning both the behavior.scaleUp and behavior.scaleDown policies, but we couldn't using the provided HPA resources, so we rolled out our own manifests on top of the chart.

One of the problems we had is that unless we enable HPA with autoscaling.enabled: true, which we don't want to given that we use our own HPA manifests, we can't avoid setting the replicas of each component.

spec:
{{- if not .Values.distributor.autoscaling.enabled }}
  replicas: {{ .Values.distributor.replicas }}
{{- end }}

That's a problem when using a GitOps operator like Argo CD, because once the HPA tries to scale, Argo CD will reconcile the state setting whatever the value is in the replicas option, preventing any scale up.

We solved it by ignoring that field in Argo CD but it'll be nice to be able to use custom HPAs configurations or KEDA objects, and still be able to avoid defining the replica in the templates.

    ignoreDifferences:
      - group: apps
        kind: Deployment
        name: loki-distributor
        namespace: loki
        jsonPointers:
          - /spec/replicas
      - group: apps
        kind: StatefulSet
        name: loki-ingester
        namespace: loki
        jsonPointers:
          - /spec/replicas
      - group: apps
        kind: Deployment
        name: loki-querier
        namespace: loki
        jsonPointers:
          - /spec/replicas
      - group: apps
        kind: Deployment
        name: loki-query-frontend
        namespace: loki
        jsonPointers:
          - /spec/replicas
    syncPolicy:
      syncOptions:
        - RespectIgnoreDifferences=true