vultr / terraform-provider-vultr

Terraform Vultr provider
https://www.terraform.io/docs/providers/vultr/
Mozilla Public License 2.0
189 stars 65 forks source link

[BUG] - Kubernetes node pool `node_quantity` state drift when auto-scaler enabled #472

Open AdamJacobMuller opened 3 months ago

AdamJacobMuller commented 3 months ago

Hi,

Describe the bug If I create a vke cluster like:

resource "vultr_kubernetes" "k8" {
    region  = "ewr"
    label   = "vke-test"
    version = "v1.28.2+1"

    node_pools {
        node_quantity = 1
        plan          = "vc2-1c-2gb"
        label         = "vke-nodepool"
        auto_scaler   = true
        min_nodes     = 1
        max_nodes     = 2
    }
} 

Every terraform runs, if my cluster has scaled up from 1 node to 2, terraform sees this and "fixes" node_quantity so that the cluster scales down to 1 node. The autoscaler then sees that 1 node is not enough and scales my cluster back to 2 nodes.

Very disruptive for workflows and workloads which are dependent on the autoscaler.

To Reproduce Steps to reproduce the behavior:

  1. create cluster with terraform with autoscaler
  2. deploy enough workload to require the cluster to scale up to more than node_quantity node
  3. run terraform plan/apply again
  4. watch cluster scale down to node_quantity then back up to max_nodes (or whatever satisfies your workload)

Expected behavior

if auto_scaler == True:
set node_quantity and max_nodes only
else:
set node_quantity

Additional context

Thank you kindly.

optik-aper commented 2 months ago

My feeling is that removing the value updates would create a workflow expectation which is too opinionated for a provider. Not only would you have to silence/ignore the quantity updates but, you'd have to ignore the new node_pools[...].nodes elements as well. That goes against the spirit of the provider.

Have you tried using the lifecycle rules to ignore_changes automatically? Here's an example for the two forms that a node pool resource takes in our provider:

resource "vultr_kubernetes" "k8" {
    region  = "ewr"
    label   = "vke-test"
    version = "v1.29.2+1"

    node_pools {
        node_quantity = 3
        plan          = "vc2-1c-2gb"
        label         = "vke-nodepool"
        auto_scaler   = true
        min_nodes     = 1
        max_nodes     = 3
    }

    lifecycle {
      ignore_changes = [node_pools]
    }
} 

resource "vultr_kubernetes_node_pools" "k8-np" {
  cluster_id = vultr_kubernetes.k8.id
  node_quantity = 3
  plan          = "vc2-1c-2gb"
  label         = "vke-nodepool-2"
  # auto_scaler   = true
  # min_nodes     = 1
  # max_nodes     = 3

  lifecycle {
    ignore_changes = [node_quantity]  
  }
}

With these settings, any updates that come to node_pools in the vultr_kubernetes resource are automatically added to the terraform state file. Same with the vultr_kubernetes_node_pools value for node_quantity.

AdamJacobMuller commented 2 months ago

Hi @optik-aper,

Thanks for the lifecycle tip, I didn't know you could do that.

Specifically doing ignore_changes=[node_pools[0].node_quantity] is great and solves my immediate issue.

With regards to the original issue, I still think the way this provider handles things is wrong though (and if you look at other providers for kubernetes clusters they seem to agree)

In my mind there are two modes for things:

A) you're using auto_scaler=true in which case you should specify min_nodes,max_nodes (and it should refuse to accept node_quantity) B) you're using auto_scaler=false in which case you should specify node_quantity (and it should refuse to accept min_nodes,max_nodes)

This behaviour mirrors how GCP (just the one I'm most familiar with) works for example.

Also, keep in mind, this is also exactly how your web UI for example works right now. If I pick autoscale, I specify min/max, if I pick manual, I specify node quantity.