operator: Horizontal Pod Autoscaling for Loki Operator

Red-GV commented 2 years ago

Is your feature request related to a problem? Please describe. During times of traffic surge, the Loki Operator requires manual intervention in order to scale up and down the Loki cluster.

Describe the solution you'd like The Loki Operator should be able to scale up and down the Loki cluster through the use of the Kubernetes Horizontal Pod Autoscaler (HPA). The HPA API has been added since Kubernetes API v1.12 (v2beta2 recommended before v1.23).

Additional context This feature would be enabled via an optional feature flag: --enabled-horizontal-autoscaling

The Loki Operator API would be expanded to allow the human operator to control how many pods would be allowed to be added or removed from the cluster for the scaling operators. The maximum number of replicas it will spin up will be determined by the max number of replicas supported by the largest size (1x.medium). The minimum will follow a similar path (1x.small).

Since the HPA would affect the ingester and querier objects, the replica field will not be updated when the HPA is in use to prevent conflicts during the reconcile loop. The flushOnShutdown field in the ingester_config should be set to true to ensure that the ingester will flush all chunks in memory and the Write-Ahead Log (WAL) directory before being removed from the cluster. As a deployment, the querier should not require any special procedure to scale up or down.

The HPA can be configured to scale up the number of pods when the average utilization of the pod's memory has reached 70%. In testing, the HPA will actually scale up the cluster about 10% above the target value (80%). It will then scale down in a similar fashion (60%). There will be stabilization windows set for both behavior to ensure that HPA avoids flapping.

Suggested API Changes

type HorizontalAutoscalingSpec struct {
    ScaleUpPercentage int32
    ScaleDownPercentage int32
}

type AutoscalingSpec struct {
    HorizontalAutoscaling *HorizontalAutoscalingSpec
}

type AutoscalingTemplateSpec struct {
    IngestionAutoscaling *AutoscalingSpec
    QueryAutoscaling *AutoscalingSpec
}

type LokiStackSpec struct {
     Autoscaling *AutoscalingTemplateSpec
}

Issues

The Persistent Volume Claims (PVC) created for the additional StatefulSet replicas are left behind when the pods are removed during the scale down procedure. There are some alpha Kubernetes features in 1.23+ that allow the PVCs to be removed when the pod is terminated. Thus, there would be some cleanup to remove those PVCs after the HPA has scaled down a StatefulSet.
In OpenShift, there is a warning alert that goes off once the HPA has scaled to the maxReplica count for over 15 minutes. Since there is only a small difference in the replica count between 1x.small and 1.medium, this alert can trigger easily during high load times. To avoid, it might be better to let the maxReplica be the replica from the LokiComponentSpec if it is larger than the default max.
[unconfirmed] While testing, the ingester pods seemed to trigger their shutdown protocol, which is what is needed for this to work well. Otherwise, the operator will need to add logic in order to follow the recommended method for scaling down with WAL enabled.

periklis commented 2 years ago

As per this issue is dedicated to explore autoscaling per path from a t-shirt size (e.g. 1x.small) to a target (e.g. 1x.medium), I would like to expand this discussion on what autoscaling based on t-shirt size could look like. The following discussion focuses on the query path only for now.

Horizontal Autoscaling the Query Path

Horizontal autoscaling of the query path from 1x.small to 1x.medium affects currently only the querier deployment. However, comparing the t-shirt sizes we get two different deployment specs regarding resource requests and replicas:

1x.small requests 2 replicas vs. 1x.medium requesting 3 replicas.
1x.small requests 4 cores vs. 1x.medium requesting 6 cores per replica.
1x.small requests 4Gi vs. 1x.medium requesting 10Gi per replica.

In addition a lift of the deployment spec implies two background changes:

The max_concurrent setting is set from 4 to 6. This means the querier will spin up more frontend-processor go routines to pick queries from the two query-frontends. In detail each querier will dedicate 3 processors per frontend replica.
The requests in total for CPU raise by 2x and Memory ~4x. In contrast horizontal autoscaling would mean for users a raise of 1x CPU and 1x Memory.

In summary, a literal t-shirt size autoscaling means scaling in two ways at a time. First, scale up resources and then scale out horizontally. This is not intuitive for users if we talk about horizontal autoscaling here. Thus we should provide a distinct description on what we mean with horizontal autoscaling of the query path from 1x.small to 1x.medium. We have two options:

Option 1: Use 1x.medium only as a target max replica for HPA

In this case autoscaling up to 1x.medium means scaling out the querier deployment up to a max replica count as defined by 1x.medium. This stays in line with what horizontal autoscaling means, i.e. more replicas of the same deployment spec. In detail the user experience looks like:

CPU/MEM requests remain as in 1x.small spec. The total requests raise as expected, i.e. it is required to provide another 4vCPUs and 4Gi of Memory on any schedulable node.
The setting max_concurrent remains the same. Means we don't need to re-populate a new version of the config map.

Option 2: Lift deployment spec then scale out

In this case autoscaling consists of two distinct scaling steps. First vertical scale up of the querier resources. Second horizontal scale out from 2 min replicas up to 3 max replicas. The user experience will look like:

CPU/MEM request raise by 2x and ~4x accordingly. Means our queriers scale vertically first and horizontally next.
The setting max_concurrent requires a change from 4 to 6. This is required to leverage the new vertical scale, i.e. the extra 2vCPUs per queriers as 50% more concurrency per querier.
Caveat: To scale queriers w/o restarting the rest of the system requires a separate copy of the loki config map for queriers. Currently all components mount the same config map!

Comparison

In general both options describe sufficiently what autoscaling of the query path from 1x.small to 1x.medium means. The user experience remains for both simple but the caveats lie on the detail. On the one hand the first option remains on the orthodox path of horizontal pod autoscaling meeting user's expectation of a higher replica count of the same querier in size. However, the user needs to be informed that 1x.medium as a target is a soft target only. On the other hand the second option expresses a scale bump to 1x.medium more literally. Nevertheless it is not true horizontal scale out, because it includes vertical scale up of CPU/Memory resources first.

Proposal: HPA for Query Path

Considering the above, I would like to propose to go with the first option for a couple of reasons:

HPA is a tough topic itself when we look on using custom or external metrics (or KEDA later). If our API provides only a switch to enable HPA and a selection of the target size (1x.medium), we should keep user experience as simple as possible. HPA means scale out and the first option stays true to this statement.
The first option does not require to tune configuration options like max_concurrent. This simplifies operations a lot meaning we still get more concurrency but only with the new querier replica. Remaining replicas keep using the same max concurrency setting for pulling query jobs from the frontend as before

Bonus: Available Query Path metrics for the autoscaler

Based on my manual experiments with prometheus-adapter's custom/external metrics support, we have the following metrics available at our disposal for HPA. I am going to list the prometheus-adapter config for demostration below:

loki_request_duration_seconds_count: This can be used on a per pod level as a surrogate of http requests per second rate metric.
loki_querier_worker_inflight_queries: This metric is still unreleased as per 5124 is only available in main. Nvm the metric can be used on a querier pod level, too. This metric can be used to inspect the queue saturation.
cortex_query_scheduler_inflight_requests: This metric is still unreleased as per 5658 is only available in main. This metric can be used as an external metric because it is a counter on different component, i.e. the query scheduler. In turn this requires to introduce this component in the LokiStack deployment, too.

Examples:

rules:

- seriesQuery: 'loki_request_duration_seconds_count{namespace!="",pod!="",route=~"loki_api_v1_query|loki_api_v1_query_range|loki_api_v1_label|loki_api_v1_label_name_values"}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
      service: {resource: "service"}
  name:
    as: "loki_query_requests_per_second"
  metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>)'

rules:
- seriesQuery: 'loki_querier_worker_inflight_queries{namespace!="",pod!=""}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
      service: {resource: "service"}
  name:
    as: "loki_query_path_saturation"
  metricsQuery: '<<.Series>>{<<.LabelMatchers>>} / loki_querier_worker_concurrency'

(cc @xperimental @dannykopping @Red-GV )

periklis commented 2 years ago

Available Ingestions Path metrics for the autoscaler

Based on my manual experiments with prometheus-adapter's custom/external metrics support, we have the following metrics available at our disposal for HPA. I am going to list the prometheus-adapter config for demostration below:

loki_request_duration_seconds_count: This can be used on a per pod level as a surrogate of http requests per second rate metric.
loki_inflight_requests: This metric can be used to inspect the distributor saturation.

Examples:

rules:

- seriesQuery: 'loki_request_duration_seconds_count{namespace!="",pod!="",route=~"loki_api_v1_push"}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
      service: {resource: "service"}
  name:
    as: "loki_ingest_requests_per_second"
  metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>)'

rules:
- seriesQuery:  'loki_inflight_requests{namespace!="",pod!="",route="loki_api_v1_push"}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
      service: {resource: "service"}
  name:
    as: "loki_ingest_path_saturation"
  metricsQuery: '<<.Series>>{<<.LabelMatchers>>} / cortex_ring_members{name="ingester",state="ACTIVE"}'

grafana / loki