Open nemethloci opened 7 months ago
Just to make clear: this is a request to improve documentation. Solving the issue behind this goes beyond the scope of this ticket and probably requires the cooperation of the GCP terraform provider developers and GCP itself.
Hi @nemethloci you can found an alternative documented here for this scenario: https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool
So you can approach features like strategy and blue_green_settings for upgrades adjusting it according to your requirements.
Hi @ggtisc the feature you referenced is valid only with regard to upgrades and unfortunately does not apply when changing other aspects of the nodepool, which my request is about (ie: disk type, node type...)
@nemethloci could you please provide the code of the resource google_container_cluster.primary
?
Here you go:
resource "google_container_cluster" "primary" {
provider = google-beta
project = var.project_id
location = var.region
name = var.cluster_name
network = var.cluster_network
subnetwork = var.cluster_subnetwork
logging_service = var.logging_service
monitoring_service = var.monitoring_service
min_master_version = local.kubernetes_version
release_channel {
channel = "UNSPECIFIED"
}
enable_shielded_nodes = var.enable_shielded_nodes
dynamic "master_authorized_networks_config" {
for_each = var.master_authorized_networks_config
content {
dynamic "cidr_blocks" {
for_each = lookup(master_authorized_networks_config.value, "cidr_blocks", [])
content {
cidr_block = cidr_blocks.value.cidr_block
display_name = lookup(cidr_blocks.value, "display_name", null)
}
}
}
}
ip_allocation_policy {
cluster_secondary_range_name = var.cluster_ip_range_pods
services_secondary_range_name = var.cluster_ip_range_services
}
# We can't create a cluster with no node pool defined, but we want to only use
# separately managed node pools. So we create the smallest possible default
# node pool and immediately delete it.
node_pool {
name = "default-pool"
initial_node_count = 1
}
remove_default_node_pool = true
# We can optionally control access to the cluster
# See https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters
private_cluster_config {
enable_private_endpoint = var.disable_public_endpoint
enable_private_nodes = var.enable_private_nodes
master_ipv4_cidr_block = var.master_ipv4_cidr_block
}
master_auth {
client_certificate_config {
issue_client_certificate = false
}
}
dynamic "authenticator_groups_config" {
for_each = var.enable_authenticator_groups_config ? [1] : []
content {
security_group = var.authenticator_groups_config_security_group
}
}
binary_authorization {
evaluation_mode = var.enable_binary_authorization
}
pod_security_policy_config {
enabled = var.enable_pod_security_policy_config
}
// Configure the workload identity "identity namespace". Requires additional
// configuration on the node pool for workload identity to function.
// See PATT-24
workload_identity_config {
workload_pool = "${var.project_id}.svc.id.goog"
}
addons_config {
http_load_balancing {
disabled = !var.http_load_balancing
}
horizontal_pod_autoscaling {
disabled = !var.horizontal_pod_autoscaling
}
network_policy_config {
disabled = !var.enable_network_policy
}
istio_config {
disabled = !var.enable_managed_istio
auth = var.managed_istio_auth
}
config_connector_config {
enabled = var.enable_config_connector
}
}
network_policy {
# Based on: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/issues/656#issuecomment-910108440
# Enabling NetworkPolicy for clusters with DatapathProvider=ADVANCED_DATAPATH is not allowed (yields error)
enabled = var.enable_dataplane_v2 ? false : (var.enable_network_policy ? true : false)
# Tigera (Calico Felix) is the only provider
# CALICO provider overrides datapath_provider setting, leaving Dataplane v2 disabled
provider = var.enable_dataplane_v2 ? "PROVIDER_UNSPECIFIED" : (var.enable_network_policy ? "CALICO" : "PROVIDER_UNSPECIFIED")
}
# This is where Dataplane V2 is enabled.
datapath_provider = var.enable_dataplane_v2 ? "ADVANCED_DATAPATH" : "DATAPATH_PROVIDER_UNSPECIFIED"
dynamic "resource_usage_export_config" {
for_each = var.enable_metering ? [1] : []
content {
enable_network_egress_metering = true
enable_resource_consumption_metering = true
bigquery_destination {
dataset_id = google_bigquery_dataset.cluster_resource_usage[0].dataset_id
}
}
}
lifecycle {
ignore_changes = [node_pool]
}
maintenance_policy {
daily_maintenance_window {
start_time = var.maintenance_start_time
}
}
timeouts {
create = "30m"
update = "30m"
delete = "30m"
}
}
One more update: google support has confirmed, that it is expected, that both upon nodepool deletion and manual draining of all nodes of a nodepool, unless the autoscaler is disabled first, new nodes would be spinned of without being cordoned by the autoscaler. As such the 'workaround' process to be used is:
Note from triage: this doesn't look like a duplicate - that issue was raised by the same author.
@nemethloci is this the same issue of the 17668?
Yes and no:
Why I have raised two tickets: when I originally reported the documentation related (this) ticket, I thought (based on some early feedback from google engineers), that google will admit, that this is a bug and they will fix it themselves, hence there would be no need to alter the terraform provider itself.
FYI: google have created a feature request (without any commitment with regard to timeline): https://issuetracker.google.com/issues/331403526
I think https://github.com/GoogleCloudPlatform/magic-modules/pull/12014 will at least partially help, by allowing in-place updates (without recreating the node pool) of at least some of the fields in google_container_cluster.node_config
I didn't mark this as being "fixed by" that, because I'm not 100% sure it solves all cases and if it's the same underlying issue, but feel free to update that and / or comment here if you think that should fix it.
Agreed that in-place updates help, but there are still some ForceNew fields within node_config
, so I think we should keep this open as a request to mention the dangers (like abrupt node termination) of changing those fields on this resource.
Community Note
Terraform Version
v1.7.4
Affected Resource(s)
google_container_node_pool
Terraform Configuration
Debug Output
No response
Expected Behavior
In case of a nodepool change, that requires changing the underlying template (which is immutable), the provider would ideally:
This is unfortunately not the current behavior probably because the above process would require the 'interaction' of multiple providers (google provider to manage the nodepools and the kubernetes provider to drain the nodes. Not sure if the nodepool resize operation on the GCP API could be abused for draining...?). Considering the nodepool upgrades work as expected following the above approach I think it would worth to emphasize in the documentation what would happen in case of non-version related changes to the node pool resource.
This would be especially important as not being aware of the above can contain service outage (see actual behavior for details)
Actual Behavior
Without create_before_destroy flag the nodepool is deleted and then recreated resulting in complete outage with regard to the services bound to the nodepool. In case it is set, there will still be outage of all services as:
Steps to reproduce
terraform apply
Important Factoids
No response
References
https://github.com/hashicorp/terraform-provider-google/issues/10895
b/331218437