hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.61k stars 4.65k forks source link

Support for `temporary_name_for_rotation` for other nodepools #22265

Open saitakturk opened 1 year ago

saitakturk commented 1 year ago

Is there an existing issue for this?

Community Note

Description

We would like to update node pools with temporary node pools so that before destroying node pools, the pods could be migrated to temporary node pools during the update, and rolling update could be enabled for node pools.

New or Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster_node_pool

Potential Terraform Configuration

resource "azurerm_kubernetes_cluster_node_pool" "ondemand_gpu" {
  name                  = "ondemandgpu"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.k8s.id
  orchestrator_version  = var.orchestrator_version
  temporary_name_for_rotation = "ondemandtemp"

  vm_size   = var.ondemand_gpu_vm_size
  min_count = var.ondemand_gpu_min_count
  max_count = var.ondemand_gpu_max_count
  zones     = local.availability_zones

  enable_auto_scaling = true

}

References

No response

PeterBennink commented 1 year ago

This would make switching to AzureLinux with the recently added support a lot easier!

danijam commented 1 year ago

Would love to be able to change the vm_size of the nodepool and have terraform manage that change such that the desired state is reached without there being a disruption to the workloads in the changed pool. (Create Pool | Move workloads | Destroy Pool | etc etc)

I'd be interested to know how people are working around this generally with AKS? How do you manage pool vm_size changes? On one hand I see Microsoft's Well Architected reference material (https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/containers/aks/baseline-aks#use-infrastructure-as-code-iac) suggests using IaC. Great!

But then the page about changing pool vm_size only refers to procedural script commands to achieve the outcome. (https://learn.microsoft.com/en-us/azure/aks/resize-node-pool?tabs=azure-cli).

I can't reconcile how to "nicely" reconcile these two worlds into something usable.

pmuszynski-gl commented 11 months ago

This future would be nice. Some node pool options changes cause force node pool replacement. The pods have no space to be evacuated if they are sticky to the node pool (using taints and tolerations). The workaround is to create the temporary node pool manually.

raswinraaj commented 6 months ago

Would like to know if this feature will be taken up in the near future?

rjones-projects commented 2 months ago

currently we are using

lifecycle {
    create_before_destroy = true
    ignore_changes        = [node_count, name]
  }

with a random suffix on the node pool name, implemented as: name = substr("${"${var.environment_name}nix"}${substr(md5(uuid()), 0, 4)}", 0, 11) # 1-12 characters

This works after a fashion but does still lead to brief outages when workloads take time to start. The GO code written by stephybun for handling system node pools should work in the kubernetes_cluster_node_pool_resource in a similar way. Is there any plan to add this to the provider with a similar temporary_name_for_rotation variable to trigger the cycling? It would be very helpful/essential in production workloads.

mustaFAB53 commented 1 month ago

Hi Team Any updates on this much needed feature?

aminmiri commented 1 month ago

Looking forward to seeing this feature added !