grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.18k stars 537 forks source link

helm: autoscaling migration procedure #7367

Open dimitarvdimitrov opened 9 months ago

dimitarvdimitrov commented 9 months ago

https://github.com/grafana/mimir/pull/7282 added autoscaling to Helm as an experimental feature. This issue is about adding support in the helm chart for a smooth migration and adding documentation for the migration.

Why do we need a migration?

Migrating to a Mimir cluster with autoscaling requires a few intermediate steps to ensure that there are no disruptions to traffic. The major risk is that enabling autoscaling also removed the replicas field from Deployments. If KEDA/HPA hasn't started autoscaling the Deployment, then k8s interprets no replicas as meaning 1 replica, which can cause an outage.

Migration in a nutshel

  1. Support a distributor.kedaAutoscaling.preserveReplicas: true field in the helm chart which doesn't delete the replicas field from the rendered manifests (https://github.com/grafana/mimir/pull/7431)
  2. Set preserveReplicas: true, deploy the chart.
  3. Wait for a couple of minutes for the HPA to start running (perhaps there's a concrete way to make sure everything is set up properly?)
  4. Remove preserveReplicas from values.yaml and deploy the chart
  5. Done

Internal docs

I'm also pasing Grafana Labs-internal documentation that's specific to our deployment tooling with FluxCD. Perhaps it can be used by folks running FluxCD or as a starting point for proper docs:

Autoscaling existing deployments

Who should use this and when

This tool and documentation helps describe how to transition from a manually set number of replicas in a Kubernetes Deployment to using a Horizontal Pod Autoscaler (HPA). It specifically highlights a pitfall of the migration and how to work around it.

Not covered in this documentation is how to implement the HPA definition. However, some examples can be found in existing jsonnet.

Problem

When autoscaling an existing Kubernetes Deployment there is a conflict between what fluxcd wants to reconcile and what the HPA has set the replicas to.

Because flux is trying to reconcile the state we want to undefine any replicas from the spec. Otherwise flux will reset the number of replicas back to the supplied amount in conflict with what the autoscaler may have set them to. This causes a continuous loop whereafter the HPA would again scale up/down and then flux would reset again and so on.

Removing the replicas from the spec is straightforward enough. However, when the updated spec is applied, Kubernetes notices the now missing item from the spec and therefore applies the default of 1 replica*. Unless the HPA has scaled the deployment, the next reconciliation of flux that removes .spec.replicas will result in the deployment being scaled down. Sometimes this is not noticeable as the HPA will then kick in and scale back up, but in other cases this can cause disruptions.

*This is not actually a bug despite the link and is intended behavour.

Solution

The goal is to be able to roll out autoscalers without the Deployment ever being reset to 1 replica (even for a brief time). We also don't want to maintain a state that is in conflict with reality. (ie we still want our jsonnet to be truthful and thus remove the replicas from the spec).

What is happening internally in Kubernetes is to do with server side applies and field ownership. The TL;DR for us is to tell Kubernetes that the HPA owns the replicas field, and to not perform any action when it sees the replicas disappear from the spec applied by flux.

We have observed that if the HPA modifies the number of replicas, it will take ownership of the .spec.replicas field; after-which it is safe to remove the replicas from our jsonnet definitions. However, if the HPA does not make any modifications to the replicas (for example, when the correct number of replicas already exist), or if you want to remove the jsonnet replicas definition sooner, we need to tell Kubernetes who is managing the field.

We do this by removing the replicas from flux's managedFields metadata on the Kubernetes object. This makes Kubernetes think that the replicas were never a part of the spec (at least from flux), and thus the next time flux is applied without the replicas defined it does not default back to 1. (This is because the HPA is now the only owner of the field).

The helper script remove_managed_replicas.sh assists with editing this metadata.

Process for switching to autoscalers

The process for switching an existing Deployment to use an autoscaler is as follows:

  1. Add a flux ignore to the object. For example: ./scripts/flux/ignore.sh --context dev-us-east-0 --namespace cortex-dev-05 --kind Deployment --reason "Transitioning to HPA".
  2. Open and merge a PR with the change that installs the HPA (potentially via KEDA), as well as removes the .spec.replicas from the object.
  3. Remove the replicas from flux's managedFields metadata using remove_managed_replicas.sh (see below). If the HPA has already scaled the object (in either direction), the field will already be correctly updated. Either way, it is still safe to run the script which will confirm this for you as a no-op.
  4. Tidy up by removing the flux ignore on the object.
  5. When flux reconciles afterwards it should be a noop as .spec.replicas is not defined and the spec is otherwise unchanged from the object.
  6. Verify that flux has ran and not defaulted back to 1 replica.
  7. Verify that the HPA is deployed and working as expected.

How to use remove_managed_replicas.sh

Usage: ./remove_managed_replicas.sh [ -c | --context ]
                                    [ -n | --namespace ]
                                    [ -o | --object ]
                                    [ -d | --dry-run ]
                                    [ -h | --help  ]
Outputs a diff of changes made to the object.

Example:

./remove_managed_replicas.sh -c dev-eu-west-2 -n mimir-dev-10 -o Deployment/distributor --dry-run

The script will output a diff of the changes made to the object. (If --dry-run is set, no actual changes are applied).

remove_managed_replicas.sh

```bash #!/usr/bin/env bash set -euo pipefail help() { echo "Usage: ./remove_managed_replicas.sh [ -c | --context ] [ -n | --namespace ] [ -o | --object ] [ -d | --dry-run ] [ -h | --help ] Outputs a diff of changes made to the object. " exit 2 } VALID_ARGUMENTS=$# if [ "$VALID_ARGUMENTS" -eq 0 ]; then help fi CONTEXT="" NAMESPACE="" OBJECT="" DRY_RUN=false DRY_RUN_ARG="" while [ "$#" -gt 0 ] do case "$1" in -c | --context ) CONTEXT="--context=${2}" shift 2 ;; -n | --namespace ) NAMESPACE="$2" shift 2 ;; -o | --object ) OBJECT="$2" shift 2 ;; -d | --dry-run ) DRY_RUN=true DRY_RUN_ARG="--dry-run=server" shift 1 ;; -h | --help) help ;; --) shift; break ;; *) echo "Unexpected option: ${1}" help ;; esac done if [ -z "${NAMESPACE}" ] then echo "Must supply a namespace." exit 1 fi if [ -z "${OBJECT}" ] then echo "Must supply a kubernetes object (such as \`-o Deployment/example\`)." exit 1 fi KC="kubectl ${CONTEXT} --namespace=${NAMESPACE}" BEFORE=$(${KC} get "${OBJECT}" -o yaml --show-managed-fields=true) BEFORE_JSON=$(${KC} get "${OBJECT}" -o json --show-managed-fields=true) INDEX=$(echo "${BEFORE_JSON}" | jq '.metadata.managedFields | map(.manager == "kustomize-controller") | index(true)') # Check we can find the position of flux's entry in managedFields: if ! [[ $INDEX =~ ^[0-9]+$ ]] then echo "Unable to find \`kustomize-controller\` (flux) in the managedFields metadata for ${OBJECT}." echo "Has flux not ran on this object before?" echo "This may happen if you have deployed the object manually." echo "It should be safe to continue removing the flux-ignore as the object was never managed by flux." exit 1 fi # Check that `.spec.replicas` is set in the managedFields entry CHECK=$(echo "${BEFORE_JSON}" | jq ".metadata.managedFields[${INDEX}].fieldsV1.\"f:spec\".\"f:replicas\"") if [ "${CHECK}" = "null" ] then echo "Unable to find \`.spec.replicas\` set in the managedFields metadata for \`kustomize-controller\`." echo "Has the field already been unset?" echo "This may happen if the HPA has already scaled the object." echo "It is safe to continue removing the flux-ignore on the object." exit 1 fi AFTER=$(${KC} patch "${OBJECT}" -o yaml --show-managed-fields=true ${DRY_RUN_ARG} --type='json' -p "[{'op': 'remove', 'path': '/metadata/managedFields/${INDEX}/fieldsV1/f:spec/f:replicas'}]") diff -u <(echo "${BEFORE}") <(echo "${AFTER}") || : if [ "${DRY_RUN}" = true ] then echo "" echo "Dry run only. No changes have been applied." fi exit 0 ```

dimitarvdimitrov commented 9 months ago

@beatkind tagging you because I don't think I can assign you without you participating in the issue. Can you assign yourself by any chance? From the github docs

You can assign multiple people to each issue or pull request, including yourself, anyone who has commented on the issue or pull request, anyone with write permissions to the repository, and organization members with read permissions to the repository. For more information, see "Access permissions on GitHub."

beatkind commented 9 months ago

@dimitarvdimitrov And I need to actively write here :) to be participating - nope I am not able to assign myself, because I do not have any permissions inside the repo

dimitarvdimitrov commented 8 months ago

@beatkind I saw you marked this as fixed by https://github.com/grafana/mimir/pull/7431. I'd like to keep this issue open until we also document the migration procedure. Migration is now technically possible, but it's not a very user-friendly process because users needs to figure out the steps themselves.

beatkind commented 8 months ago

@dimitarvdimitrov thanks for reopening, this was simply a mistake :) - I will add some documentation with my next PR

KyriosGN0 commented 4 months ago

hey @beatkind @dimitarvdimitrov , i just tried to follow Migration in a nutshel guide, i found 2 issues

  1. after i removed preserveReplicas, my distributer, querier and query-fronted (we don't rulers) all got scaled to 1
  2. after couple of minutes they scaled backup to normal but the distributer was messed up, out of 18 replicas only one got traffic, the k8s services did list all of the pods correctly, and i didn't see errors in any of the pods, my guess is that sudden scale up messed up the ring/memberlist stuff
dimitarvdimitrov commented 4 months ago

1. after i removed preserveReplicas, my distributer, querier and query-fronted (we don't rulers) all got scaled to 1

how long after deploying the HPA did this happen? Was the HPA already working?

2. after couple of minutes they scaled backup to normal but the distributer was messed up, out of 18 replicas only one got traffic, the k8s services did list all of the pods correctly, and i didn't see errors in any of the pods, my guess is that sudden scale up messed up the ring/memberlist stuff

this looks like a problem with loadbalancing. The ring doesn't determine distributor load. Perhaps the reverse proxy in front of the distributor proxy didn't get the update quickly enough (or DNS was cached, etc.). What reverse proxy are you using - is it the nginx from the chart? Did this fix itself resolve eventually?

KyriosGN0 commented 4 months ago
  1. almost immediately (up to maybe a minute later), and yes the HPA was working
  2. i used a Servcie of type LoadBalancer (running GKE), so yeah i guess that GKE LB didn't route this correctly, i solved this manually by the only pod that received traffic
dimitarvdimitrov commented 4 months ago
  1. almost immediately (up to maybe a minute later), and yes the HPA was working

I'd expect that doing it after the pollingInterval of 10s is enough for KEDA's HPA to take over, but can't be sure. We should retest this procedure before publishing the docs.

beatkind commented 4 months ago
  1. almost immediately (up to maybe a minute later), and yes the HPA was working

I'd expect that doing it after the pollingInterval of 10s is enough for KEDA's HPA to take over, but can't be sure. We should retest this procedure before publishing the docs.

Good point, will validate it again with a test env

skinlayers commented 3 weeks ago

Just wanted to leave a note that we also encountered our replicas scaling down to 1 when we set preserveReplicas to false for mimir querier in our test environment. KEDA had been deployed several days before. So it seem like step 2. of the migration isn't adequate (unless I'm missing something). We can't afford to have our queriers drop to 1 in production, so we would like some guidance on how to safely proceed. Or is KEDA autoscaling still too experimental, and should be avoided in production?

  querier:
    replicas: 3
    kedaAutoscaling:
      enabled: true
      preserveReplicas: false
      minReplicaCount: 3
      maxReplicaCount: 40