hashicorp / terraform-provider-azurerm

Terraform provider for Azure Resource Manager
https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs
Mozilla Public License 2.0
4.6k stars 4.64k forks source link

Every plan run resets upgrade_settings.max_surge for default_node_pool #24020

Open mloskot opened 11 months ago

mloskot commented 11 months ago

Is there an existing issue for this?

Community Note

Terraform Version

1.6.4

AzureRM Provider Version

3.82.0

Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster

Terraform Configuration Files

// https://registry.terraform.io/providers/hashicorp/azurerm/3.82.0/docs/resources/kubernetes_cluster#example-usage
resource "azurerm_kubernetes_cluster" "example" {
  name                = "example-aks1"
  location            = azurerm_resource_group.example.location
  resource_group_name = azurerm_resource_group.example.name
  dns_prefix          = "exampleaks1"

  default_node_pool {
    name       = "default"
    node_count = 1
    vm_size    = "Standard_D2_v2"
  }

  //...
}

Debug Output/Panic Output

~ default_node_pool {
            name                         = "default"
            ...
            # (23 unchanged attributes hidden)

          - upgrade_settings {
              - max_surge = "10%" -> null
            }
        }

        # (7 unchanged blocks hidden)
    }

Plan: 0 to add, 1 to change, 0 to destroy.

Expected Behaviour

Subsequent runs of terraform plan not reporting any changes to upgrade_settings.max_surge property.

Actual Behaviour

I use azurerm_resource_group to create new cluster without specifying custom upgrade_settings in default_node_pool and Azure calculates for me default "Maximum surge":

image

Then, every time I run terraform plan, still without customised upgrade_settings in my `.tf, then the provider always try to modify my cluster with

- upgrade_settings {
      - max_surge = "10%" -> null
  }

This is not an expected behaviour, is it?

Steps to Reproduce

  1. terraform plan
  2. terraform apply
  3. terraform plan
  4. terraform plan
  5. ...

Important Factoids

No response

References

No response

paulgmiller commented 10 months ago

AKS changed the default of max surge in october release so that if you are k8s > 1.28. then max surge is defaulted to 10% (previously it was left blank which implied 1 under the covers)

Release Release 2023-10-01 · Azure/AKS (github.com)

Is there a way we could have done this better for terraform?

aa2811 commented 10 months ago

Just to add, we've worked around this by explicitly setting max_surge = "10%" in our terraform files , to avoid the unnecessary changes on plan/apply cycles.

jayctran commented 9 months ago

EDIT: Ignore me - rookie error, was working off the wrong duplicated file , this also resolved my issue when I modified the correct file

Just to add, we've worked around this by explicitly setting max_surge = "10%" in our terraform files , to avoid the unnecessary changes on plan/apply cycles.

hi @aa2811, do you mind sharing how you did this? I've added this without any luck:

default_node_pool { upgrade_settings { max_surge = "10%" } }

brunogomes1 commented 9 months ago

I have been temporarily avoiding this with:

in the resource "azurerm_kubernetes_cluster" I add a line in the lifecycle block(considering you do not change the default node pool like me). eg:

lifecycle {
    ignore_changes = [
      default_node_pool[0],
    ]
  }

and for the resource "azurerm_kubernetes_cluster_node_pool" I do the same, eg:

lifecycle {
    prevent_destroy = false
    ignore_changes = [
      upgrade_settings,
    ]
  }
finkinfridom commented 5 months ago
lifecycle {
   prevent_destroy = false
   ignore_changes = [
     upgrade_settings,
   ]
 }

this indeed is a workaround to the issue. What if I want to change this value? I should remove the lifecycle.ignore_changes metadata, apply the change and then re-apply the ignore_changes?

Our team decided to always specify the upgrade_settings.max_surge info (and this avoid useless plan and apply operation) but unfortunately this cannot be applied to Spot instances nodes as the max_surge could not be specified, an empty upgrade_settings block will trigger the initial scenario (where everytime an update is seen) and providing a null-based max_surge value will trigger an error because max_surge is required by the provider.

Any suggestion?

Pionerd commented 5 months ago

What if I want to change this value? I should remove the lifecycle.ignore_changes metadata, apply the change and then re-apply the ignore_changes?

Yes, but if you want to change it, there is no longer a need for the ignore_changes anyway, since from that moment onward you do define a value for it (instead of the missing 10% where others are suffering from).

Any suggestion?

This issue does not apply to Spot instances since, as you already mentioned, it cannot be specified. So you are probably referring to tf code that tries to create both spot and on-demand node pools within the same logic, for which the solution would be to split it up in separate code for spot and on-demand.

finkinfridom commented 5 months ago

Yes, but if you want to change it, there is no longer a need for the ignore_changes anyway, since from that moment onward you do define a value for it (instead of the missing 10% where others are suffering from).

Yeah, you're right. Sorry but I was in a rush and didn't think properly to the question asked. My bad.

This issue does not apply to Spot instances since, as you already mentioned, it cannot be specified. So you are probably referring to tf code that tries to create both spot and on-demand node pools within the same logic, for which the solution would be to split it up in separate code for spot and on-demand.

Well, kind of. I mean, we have a dedicated module for aks node pool management but indeed we're executing it twice. So this lead to have 2 different and (already) separate resources to be created. Am I wrong?

I was thinking in adding a dynamic ignore_changes block for when priority is set to Spot. This should work, right?

[EDIT] Just found an open discussion and a closed PR (https://github.com/hashicorp/terraform/pull/32608) about my idea of having dynamic lifecycle but it's not implemented (as it throws Blocks of type "lifecycle" are not expected here. I think your suggested solution is the only feasible one where we'll create 2 separate modules for Regular node pools and Spot node pools.

FrancoisPoinsot commented 4 months ago

so I faced the same problem. I am also in the situation of attempting to use the same module for spot and non-spot nodepools.

If anyone reads that, here is a workaround. That config is valid for spot node pools:

upgrade_settings {
  max_surge: ""
}

so you can always specify upgrade_settings even for spot node pool, to work around the issue of this thread. And you can set max_surge's value dynamically to workaround that issue: https://github.com/hashicorp/terraform-provider-azurerm/issues/19355