Open Red-GV opened 2 years ago
As per this issue is dedicated to explore autoscaling per path from a t-shirt size (e.g. 1x.small
) to a target (e.g. 1x.medium
), I would like to expand this discussion on what autoscaling based on t-shirt size could look like. The following discussion focuses on the query path only for now.
Horizontal autoscaling of the query path from 1x.small
to 1x.medium
affects currently only the querier deployment. However, comparing the t-shirt sizes we get two different deployment specs regarding resource requests and replicas:
1x.small
requests 2 replicas vs. 1x.medium
requesting 3 replicas.1x.small
requests 4 cores vs. 1x.medium
requesting 6 cores per replica.1x.small
requests 4Gi vs. 1x.medium
requesting 10Gi per replica.In addition a lift of the deployment spec implies two background changes:
max_concurrent
setting is set from 4 to 6. This means the querier will spin up more frontend-processor go routines to pick queries from the two query-frontends. In detail each querier will dedicate 3 processors per frontend replica.In summary, a literal t-shirt size autoscaling means scaling in two ways at a time. First, scale up resources and then scale out horizontally. This is not intuitive for users if we talk about horizontal autoscaling here. Thus we should provide a distinct description on what we mean with horizontal autoscaling of the query path from 1x.small
to 1x.medium
. We have two options:
In this case autoscaling up to 1x.medium
means scaling out the querier deployment up to a max replica count as defined by 1x.medium
. This stays in line with what horizontal autoscaling means, i.e. more replicas of the same deployment spec. In detail the user experience looks like:
1x.small
spec. The total requests raise as expected, i.e. it is required to provide another 4vCPUs and 4Gi of Memory on any schedulable node.max_concurrent
remains the same. Means we don't need to re-populate a new version of the config map.In this case autoscaling consists of two distinct scaling steps. First vertical scale up of the querier resources. Second horizontal scale out from 2
min replicas up to 3
max replicas. The user experience will look like:
max_concurrent
requires a change from 4
to 6
. This is required to leverage the new vertical scale, i.e. the extra 2vCPUs per queriers as 50% more concurrency per querier.In general both options describe sufficiently what autoscaling of the query path from 1x.small
to 1x.medium
means. The user experience remains for both simple but the caveats lie on the detail. On the one hand the first option remains on the orthodox path of horizontal pod autoscaling meeting user's expectation of a higher replica count of the same querier in size. However, the user needs to be informed that 1x.medium
as a target is a soft target only. On the other hand the second option expresses a scale bump to 1x.medium
more literally. Nevertheless it is not true horizontal scale out, because it includes vertical scale up of CPU/Memory resources first.
Considering the above, I would like to propose to go with the first option for a couple of reasons:
1x.medium
), we should keep user experience as simple as possible. HPA means scale out and the first option stays true to this statement.max_concurrent
. This simplifies operations a lot meaning we still get more concurrency but only with the new querier replica. Remaining replicas keep using the same max concurrency setting for pulling query jobs from the frontend as beforeBased on my manual experiments with prometheus-adapter's custom/external metrics support, we have the following metrics available at our disposal for HPA. I am going to list the prometheus-adapter config for demostration below:
loki_request_duration_seconds_count
: This can be used on a per pod level as a surrogate of http requests per second rate metric.loki_querier_worker_inflight_queries
: This metric is still unreleased as per 5124 is only available in main
. Nvm the metric can be used on a querier pod level, too. This metric can be used to inspect the queue saturation.cortex_query_scheduler_inflight_requests
: This metric is still unreleased as per 5658 is only available in main
. This metric can be used as an external metric because it is a counter on different component, i.e. the query scheduler. In turn this requires to introduce this component in the LokiStack deployment, too.Examples:
rules:
- seriesQuery: 'loki_request_duration_seconds_count{namespace!="",pod!="",route=~"loki_api_v1_query|loki_api_v1_query_range|loki_api_v1_label|loki_api_v1_label_name_values"}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
service: {resource: "service"}
name:
as: "loki_query_requests_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>)'
rules:
- seriesQuery: 'loki_querier_worker_inflight_queries{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
service: {resource: "service"}
name:
as: "loki_query_path_saturation"
metricsQuery: '<<.Series>>{<<.LabelMatchers>>} / loki_querier_worker_concurrency'
(cc @xperimental @dannykopping @Red-GV )
Based on my manual experiments with prometheus-adapter's custom/external metrics support, we have the following metrics available at our disposal for HPA. I am going to list the prometheus-adapter config for demostration below:
loki_request_duration_seconds_count
: This can be used on a per pod level as a surrogate of http requests per second rate metric.loki_inflight_requests
: This metric can be used to inspect the distributor saturation.Examples:
rules:
- seriesQuery: 'loki_request_duration_seconds_count{namespace!="",pod!="",route=~"loki_api_v1_push"}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
service: {resource: "service"}
name:
as: "loki_ingest_requests_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[1m])) by (<<.GroupBy>>)'
rules:
- seriesQuery: 'loki_inflight_requests{namespace!="",pod!="",route="loki_api_v1_push"}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
service: {resource: "service"}
name:
as: "loki_ingest_path_saturation"
metricsQuery: '<<.Series>>{<<.LabelMatchers>>} / cortex_ring_members{name="ingester",state="ACTIVE"}'
Is your feature request related to a problem? Please describe. During times of traffic surge, the Loki Operator requires manual intervention in order to scale up and down the Loki cluster.
Describe the solution you'd like The Loki Operator should be able to scale up and down the Loki cluster through the use of the Kubernetes Horizontal Pod Autoscaler (HPA). The HPA API has been added since Kubernetes API v1.12 (v2beta2 recommended before v1.23).
Additional context This feature would be enabled via an optional feature flag:
--enabled-horizontal-autoscaling
The Loki Operator API would be expanded to allow the human operator to control how many pods would be allowed to be added or removed from the cluster for the scaling operators. The maximum number of replicas it will spin up will be determined by the max number of replicas supported by the largest size (
1x.medium
). The minimum will follow a similar path (1x.small
).Since the HPA would affect the
ingester
andquerier
objects, thereplica
field will not be updated when the HPA is in use to prevent conflicts during the reconcile loop. TheflushOnShutdown
field in theingester_config
should be set to true to ensure that theingester
will flush all chunks in memory and the Write-Ahead Log (WAL) directory before being removed from the cluster. As adeployment
, the querier should not require any special procedure to scale up or down.The HPA can be configured to scale up the number of pods when the average utilization of the pod's memory has reached 70%. In testing, the HPA will actually scale up the cluster about 10% above the target value (80%). It will then scale down in a similar fashion (60%). There will be stabilization windows set for both behavior to ensure that HPA avoids flapping.
Suggested API Changes
Issues
maxReplica
count for over 15 minutes. Since there is only a small difference in the replica count between1x.small
and1.medium
, this alert can trigger easily during high load times. To avoid, it might be better to let themaxReplica
be thereplica
from theLokiComponentSpec
if it is larger than the default max.