Recreating Node Pools causes temporary outage

ghost commented 3 years ago

Even when using a node pool with create_before_destroy, there is no mechanism to taint/drain the old nodes before deleting the old node pool. This results in applications that reside entirely in one node pool becoming unavailable.

This is made worse when using admission controllers that also use services on those nodes (i.e., Istio).

Perhaps a destroy provisioner that taints and drain the node pool in Kubernetes would improve this behaviour?

nstogner commented 3 years ago

Having this module drain nodes would be a killer feature that would improve devops'ers lives everywhere.

morgante commented 3 years ago

Agreed, this would be a very nice (optional) addition. I probably won't have time to work on it soon, but would be happy to review a PR.

skakauridze-clgx commented 3 years ago

Hello @morgante is there any update on this? Can we also Upvote priorities? Maybe from Google Engagement?

morgante commented 3 years ago

I've done a bit of investigation, and my understanding is that GKE actually automatically drains pods on deleted node pools before deleting the nodes. Is anyone observing different behavior, particularly when using the create_before_destroy update variants?

If we can get by with native functionality, I do want to avoid overcomplicating this module where possible.

obenami-clgx commented 3 years ago

@morgante I see conflicting documentation on whether GKE automatically drains pods on deleted node pools (assuming the cli isn't somehow special, see here vs here), can you clarify?

lauraseidler commented 2 years ago

We have the same issue - even with using the update variants. The new node pool does get created first, but as soon as it's up, the old one gets deleted without being drained, so all existing pods get immediately terminated, without respecting disruption budgets etc.

@morgante if this is still something that could be a good fit, I (or someone from my team) might give this a shot, as it's quite disruptive for us.

florianmutter commented 1 year ago

We "solved" this by creating a conftest policy that prevents deletion of node pools with more than 0 nodes. Our clusters have 2 node pools so we need to manually drain the node pool that will be deleted. This prevents us from forgetting to drain it but it does not do anything automatically. You can run this agains the terraform plan in json format:

package main

# Prevent node pools from being deleted if there are still nodes in it. When deleting node pools GKE does not respect
# Pod Disrution Budgets. Therefor we need to make sure we drained the nodes by hand before deleting a node pool.
# See https://cloud.google.com/kubernetes-engine/docs/concepts/node-pools#drain
deny[msg] {
    pool_address := changing_node_pools_to_cluster[_][_].address
    changing_node_pools_to_cluster[_][_].change.actions[_] == "delete"
    reason := changing_node_pools_to_cluster[_][_].action_reason
    node_count := changing_node_pools_to_cluster[_][_].change.before.node_count
    node_count != 0
    msg := sprintf("Pool %v will be deleted (reason: %v) but has %v nodes. Please drain and scale down to 0 first.", [pool_address, reason, node_count])
}

# Creates an object for all clusters and associated node pools with the cluster address as key and a list of node pools as value.
# Contains only the node pools where the change action is not one of "no-op", "read" or "create". See https://developer.hashicorp.com/terraform/internals/json-format#change-representation for possible values.
# Example:
# { "module.gke.google_container_cluster.primary":
#   [
#       {
#           "action_reason": "replace_because_cannot_update",
#           "address": "module.gke.google_container_node_pool.pools[\"pool1\"]",
#           "change": {
#               "actions": [
#               "delete",
#               "create"
#               ],
#           "after": ...
#       },
#       { ... }
#   ]
# }
changing_node_pools_to_cluster[cluster_address] := node_pools {
    some c

    # Find all planned resources of type "google_container_cluster"
    planned_resources[c].type == "google_container_cluster"
    # Use the address as key in the resulting object
    cluster_address := c.address

    node_pools := [node_pool |
        some n, action
        input.resource_changes[n].type == "google_container_node_pool"
        # Evaluates both variants of change_belongs_to_cluster and if one matches this evaluates to true
        change_belongs_to_cluster(input.resource_changes[n].change, c.values.name)
        input.resource_changes[n].change.actions[action] != "no-op"
        input.resource_changes[n].change.actions[action] != "read"
        input.resource_changes[n].change.actions[action] != "create"
        node_pool := input.resource_changes[n]
    ]
}

# Variant to match cluster name in after block of change
change_belongs_to_cluster(change, cluster_name) {
    change.after.cluster == cluster_name
}

# Variant to match cluster name in before block of change
change_belongs_to_cluster(change, cluster_name) {
    change.before.cluster == cluster_name
}

# Creates an object for all clusters and associated node pools with the cluster address as key and a list of node pools addresses as value.
# Contains all clusters that are defined in the terraform configuration.
# Example: { "module.gke.google_container_cluster.primary": ["module.gke.google_container_node_pool.pools["pool1"]", "module.gke.google_container_node_pool.pools["pool2"]"]}
planned_node_pools_to_cluster[cluster_address] := node_pools {
    some c

    planned_resources[c].type == "google_container_cluster"
    cluster_address := c.address

    node_pools := [node_pool |
        some n
        planned_resources[n].type == "google_container_node_pool"
        planned_resources[n].values.cluster == c.values.name
        node_pool := planned_resources[n].address
    ]
}

changed_resources := input.resource_changes

# List of all planned resources. Regardless whether they are root resources or child resources of a module.
planned_resources := {r |
    some path, value

    # Walk over the JSON tree and check if the node we are
    # currently on is a module (either root or child) planned resources
    # value.
    walk(input.planned_values, [path, value])

    # Look for planned resources in the current value based on path
    rs := module_resources(path, value)

    # Aggregate them into `resources`
    r := rs[_]
}

# Variant to match root_module resources
module_resources(path, value) := rs {
    # Expect something like:
    #
    #     {
    #         "root_module": {
    #             "resources": [...],
    #             ...
    #         }
    #         ...
    #     }
    #
    # Where the path is [..., "root_module", "resources"]

    reverse_index(path, 1) == "resources"
    reverse_index(path, 2) == "root_module"
    rs := value
}

# Variant to match child_modules resources
module_resources(path, value) := rs {
    # Expect something like:
    #
    #     {
    #         ...
    #         "child_modules": [
    #             {
    #                 "resources": [...],
    #                 ...
    #             },
    #             ...
    #         ]
    #         ...
    #     }
    #
    # Where the path is [..., "child_modules", 0, "resources"]
    # Note that there will always be an index int between `child_modules`
    # and `resources`. We know that walk will only visit each one once,
    # so we shouldn't need to keep track of what the index is.

    reverse_index(path, 1) == "resources"
    reverse_index(path, 3) == "child_modules"
    rs := value
}

reverse_index(path, idx) := value {
    value := path[count(path) - idx]
}

nemethloci commented 1 year ago

The situation is even worse if you have cluster autoscaling. Even if create_before_destroy is set: if you have a low minimum node count set for your nodepool it gets created with the minimum number of nodes initially. Once this is created the old nodepool is destroyed without it being properly drained (at least poddisruption budgets are not respected it seems). Assuming, that the old nodepool was scaled out significantly compared to the minimum number, most pods will be terminated and will be Pending waiting to be scheduled until the autoscaler incrementally increases the nodepool size, so that they could be scheduled.

savanipoojan78 commented 11 months ago

In reference to the provided documentation link, it's important to note that Pod Disruption Budgets (PDBs) might not be honored when deleting a node pool in Google Kubernetes Engine (GKE). This limitation can pose a challenge when scaling node pools without causing downtime for your workloads.

You may find it cumbersome to follow a manual process that involves creating a new node pool, cordoning the older node pool nodes one by one, deleting the old nodes, and then importing the new node pool into the state file. This process becomes particularly laborious when dealing with multiple node pools.

Is there a more efficient way to handle node pool scaling while maintaining high availability for your workloads?

terraform-google-modules / terraform-google-kubernetes-engine

Recreating Node Pools causes temporary outage #794