container_cluster.node_config.kubelet_config.cpu_cfs_quota value is set in cluster request even when no value is set in config

hashicorp / terraform-provider-google

Terraform Provider for Google Cloud Platform

https://registry.terraform.io/providers/hashicorp/google/latest/docs

Mozilla Public License 2.0

2.3k stars 1.73k forks source link

container_cluster.node_config.kubelet_config.cpu_cfs_quota value is set in cluster request even when no value is set in config #15767

Open hoskeri opened 1 year ago

hoskeri commented 1 year ago

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to the modular-magician user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to hashibot, a community member has claimed the issue already.

Terraform Version

cloudtop:~$ terraform version Terraform v1.5.6 on linux_amd64

Affected Resource(s)

google_container_node_pool google_container_cluster

Terraform Configuration Files

resource "google_container_cluster" "primary" {
  name     = "some-cluster"
  location = "us-central1-a"
  initial_node_count       = 1
  node_config {
    kubelet_config {
      cpu_manager_policy = "none"
      pod_pids_limit = 8192
    }
  }
}

Debug Output

023-09-07T20:43:49.384Z [DEBUG] provider.terraform-provider-google_v4.81.0_x5: ---[ REQUEST ]---------------------------------------
2023-09-07T20:43:49.384Z [DEBUG] provider.terraform-provider-google_v4.81.0_x5: POST /v1/projects/hoskeri-gke-dev/locations/us-central1-a/clusters?alt=json&prettyPrint=false HTTP/1.1
2023-09-07T20:43:49.384Z [DEBUG] provider.terraform-provider-google_v4.81.0_x5: Host: container.googleapis.com
2023-09-07T20:43:49.384Z [DEBUG] provider.terraform-provider-google_v4.81.0_x5: User-Agent: google-api-go-client/0.5 Terraform/1.5.6 (+https://www.terraform.io) Terraform-Plugin-SDK/2.10.1 terraform-provider-google/4.81.0

...
023-09-07T20:43:49.384Z [DEBUG] provider.terraform-provider-google_v4.81.0_x5:   "networkConfig": {},
2023-09-07T20:43:49.384Z [DEBUG] provider.terraform-provider-google_v4.81.0_x5:   "networkPolicy": {},
2023-09-07T20:43:49.384Z [DEBUG] provider.terraform-provider-google_v4.81.0_x5:   "nodeConfig": {
2023-09-07T20:43:49.384Z [DEBUG] provider.terraform-provider-google_v4.81.0_x5:    "kubeletConfig": {
2023-09-07T20:43:49.384Z [DEBUG] provider.terraform-provider-google_v4.81.0_x5:     "cpuCfsQuota": false,
2023-09-07T20:43:49.384Z [DEBUG] provider.terraform-provider-google_v4.81.0_x5:     "cpuManagerPolicy": "none",
2023-09-07T20:43:49.384Z [DEBUG] provider.terraform-provider-google_v4.81.0_x5:     "podPidsLimit": "8192"
2023-09-07T20:43:49.384Z [DEBUG] provider.terraform-provider-google_v4.81.0_x5:    },
2023-09-07T20:43:49.384Z [DEBUG] provider.terraform-provider-google_v4.81.0_x5:    "loggingConfig": {

Panic Output

Expected Behavior

I would generally expect that any unset fields in configuration would not have a value in the cluster create request.

In this case, (refer to hcl above) cluster.node_config.kubelet_config.cpu_cfs_quota should not have resulted in that value not being set in the cluster create request, and resulting cluster should have had that value not set at all (set to null)

Actual Behavior

However, the value of cpu_cfs_quota is set by the provider to false in the cluster create request. The only way to avoid this outcome is to not set kubelet_config at all (or force cpu_cfs_quota to true.

Steps to Reproduce

terraform apply

Important Factoids

b/300067081

hoskeri commented 1 year ago

Also, the debug log had the following warnings...

2023-09-07T20:47:04.246Z [WARN]  Provider "provider[\"registry.terraform.io/hashicorp/google\"]" produced an unexpected new value for google_container_cluster.primary, but we are tolerating it because it is using the legacy plugin SDK.
    The following problems may be the cause of any confusing errors from downstream operations:
      - .node_config[0].boot_disk_kms_key: was null, but now cty.StringVal("")
      - .node_config[0].node_group: was null, but now cty.StringVal("")
      - .node_config[0].kubelet_config[0].cpu_cfs_quota: was null, but now cty.False
      - .node_config[0].kubelet_config[0].cpu_cfs_quota_period: was null, but now cty.StringVal("")

bobbypage commented 1 year ago

/cc

wyardley commented 1 month ago

Could this be somehow related to this line and / or this PR?

wyardley commented 1 month ago

Wasn't immediately obvious to me since cpu_manager_policy is required if node_config.kubelet_config is set, and the docs only mention two available options, but updating the value doesn't actually make any changes on the default node pool, and I actually get a permadiff with it set to either "none" or "static". Setting the value to the undocumented (but allowed by the validation) available empty string ("").

node_config {
    kubelet_config {
      cpu_manager_policy = "none"

wyardley commented 4 weeks ago

Yeah, came across this again when working on https://github.com/GoogleCloudPlatform/magic-modules/pull/11572

Basically, I think the issue is that the API default is "true", and doesn't send a response when the value is the default value: https://pkg.go.dev/google.golang.org/api/container/v1#NodeKubeletConfig

So, to really fix this, unless they want to do a breaking change and switch the attribute to string values (unlikely with 6.0 just having gone out), I'm guessing probably the provider needs to either disable force sending the field (and suppress a permadiff possibly?), and / or infer the default being true when unset? I'm not sure what the side effects of disabling force send would be.

melinath commented 4 weeks ago

Is this showing up as a proposed change in the plan?

melinath commented 4 weeks ago

Also, the debug log had the following warnings...

2023-09-07T20:47:04.246Z [WARN]  Provider "provider[\"registry.terraform.io/hashicorp/google\"]" produced an unexpected new value for google_container_cluster.primary, but we are tolerating it because it is using the legacy plugin SDK.
    The following problems may be the cause of any confusing errors from downstream operations:
      - .node_config[0].boot_disk_kms_key: was null, but now cty.StringVal("")
      - .node_config[0].node_group: was null, but now cty.StringVal("")
      - .node_config[0].kubelet_config[0].cpu_cfs_quota: was null, but now cty.False
      - .node_config[0].kubelet_config[0].cpu_cfs_quota_period: was null, but now cty.StringVal("")

This kind of warning doesn't usually indicate a real problem and is safe to ignore.

wyardley commented 4 weeks ago

Is this showing up as a proposed change in the plan?

@melinath so I think one consequence of this is that if you set any (other) element in node_config.kubelet_config, I believe this will force the value to false vs. the API default of true. Because in-place updates don't currently work (at least for the nested value), if the cluster is created without a kubelet_config and then one is added, there will be a permadiff where it will try to change the value. If we fix updates as has been discussed, this would make the situation worse.

Creating a cluster with the value explicitly set to true or false seems to work in my testing.

I haven't tested with a separate nodepool resource, but I would guess there's a similar (and maybe more risky) issue there?

Also, I think the provider is infering a value of false when the API doesn't return the value (because it's optional and default) when it's also true, but could be wrong there.

Edit: I'll try to post a practical example below that I think will help illustrate it.

wyardley commented 4 weeks ago

Initial config:

resource "google_container_cluster" "test_cfs_quota" {
  name               = "test-cfs-quota"
  location           = "us-central1-f"
  initial_node_count = 1

  node_config {
    kubelet_config {
      # Other values not set here
      cpu_manager_policy = "static"
    }
  }

  deletion_protection = false
  network             = "default"
  subnetwork          = "default"
}

after applying:

% gcloud container clusters describe --location us-central1-f test-cfs-quota | yq .nodeConfig.kubeletConfig                     
cpuCfsQuota: false
cpuManagerPolicy: static
% gcloud container node-pools describe default-pool --cluster=test-cfs-quota --location=us-central1-f --flatten=config --format="value(kubeletConfig)"
cpuCfsQuota=False;cpuManagerPolicy=static

Note that cpuCfsQuota=False despite it not being set in the config, and API default being True. IMO, this is maybe the most problematic / unexpected behavior here. However, since we created the cluster this way, it will apply clean.

Now we have the permadiff from the update stuff not actually functioning

      ~ node_config {
            tags                        = []
            # (17 unchanged attributes hidden)

          ~ kubelet_config {
              ~ cpu_cfs_quota      = false -> true
                # (2 unchanged attributes hidden)
            }

But now, let's destroy and recreate the cluster without anything in node_config.kubelet_config so we get the default setting of unset (which is true / True) for cpu_cfs_quota:

resource "google_container_cluster" "test_cfs_quota" {
  name               = "test-cfs-quota"
  location           = "us-central1-f"
  initial_node_count = 1

  deletion_protection = false
  network             = "default"
  subnetwork          = "default"
}

Now we get:

% gcloud container clusters describe --location us-central1-f test-cfs-quota | yq .nodeConfig.kubeletConfig
null
% gcloud container node-pools describe default-pool --cluster=test-cfs-quota --location=us-central1-f --flatten=config --format="value(kubeletConfig)"                     
[no output]

I can't reproduce any other major issues there if other parameters are set and cpu_cfs_quota is not, but if I add an explicit cpu_cfs_quota = true:

  node_config {
    kubelet_config {
      cpu_manager_policy = ""
      cpu_cfs_quota      = true
    }
  }

We'll end up with this permadiff, since the provider doesn't realize that unset is true:

      ~ node_config {
            tags                        = []
            # (17 unchanged attributes hidden)

          + kubelet_config {
              + cpu_cfs_quota = true
            }

            # (1 unchanged block hidden)

It's fair that if #19225 were fixed, this situation would be somewhat better; I'm not sure if we'd end up with an implicit or explicit true value then.

melinath commented 4 weeks ago

Relevant contribution docs: https://googlecloudplatform.github.io/magic-modules/develop/permadiff/

wyardley commented 4 weeks ago

Thanks, yes, have reviewed that semi-recently, but didn't think about it in relation to this

Are you suggesting to set the client-side default in this case as described here?:

"field": {
     Default: "DEFAULT_VALUE",
}

to the field? And would this be breaking then?

or just to update the flattener like the example here

wyardley commented 4 weeks ago

Maybe something like this (though I'm still seeing it false with this change) [edit: probably need to handle v as well as c here]:

diff --git a/mmv1/third_party/terraform/services/container/node_config.go.erb b/mmv1/third_party/terraform/services/container/node_config.go.erb
index 14ec2f469..0fe38be05 100644
--- a/mmv1/third_party/terraform/services/container/node_config.go.erb
+++ b/mmv1/third_party/terraform/services/container/node_config.go.erb
@@ -2,6 +2,7 @@
 package container

 import (
+   "reflect"
 <% unless version == "ga" -%>
    "strings"
 <% end -%>
@@ -602,6 +603,7 @@ func schemaNodeConfig() *schema.Schema {
                            "cpu_cfs_quota": {
                                Type:     schema.TypeBool,
                                Optional: true,
+                               Computed: true,
                                Description: `Enable CPU CFS quota enforcement for containers that specify CPU limits.`,
                            },
                            "cpu_cfs_quota_period": {
@@ -1585,6 +1587,13 @@ func flattenSecondaryBootDisks(c []*container.SecondaryBootDisk) []map[string]in
    return result
 }

+func flattenCpuCfsQuota(c *container.NodeKubeletConfig) bool {
+   if c != nil && tpgresource.IsEmptyValue(reflect.ValueOf(c.CpuCfsQuota)) {
+       return true
+   }
+   return c.CpuCfsQuota
+}
+
 func flattenInsecureKubeletReadonlyPortEnabled(c *container.NodeKubeletConfig) string {
    // Convert bool from the API to the enum values used internally
    if c != nil && c.InsecureKubeletReadonlyPortEnabled {
@@ -1742,7 +1751,7 @@ func flattenKubeletConfig(c *container.NodeKubeletConfig) []map[string]interface
    result := []map[string]interface{}{}
    if c != nil {
        result = append(result, map[string]interface{}{
-           "cpu_cfs_quota":                          c.CpuCfsQuota,
+           "cpu_cfs_quota":                          flattenCpuCfsQuota(c),
            "cpu_cfs_quota_period":                   c.CpuCfsQuotaPeriod,
            "cpu_manager_policy":                     c.CpuManagerPolicy,
            "insecure_kubelet_readonly_port_enabled": flattenInsecureKubeletReadonlyPortEnabled(c),

melinath commented 4 weeks ago

Are you suggesting to set the client-side default to the field? And would this be breaking then?

I'm not making a specific recommendation, just noting that this seems to have a permadiff component. A change to the client-side default would be a breaking change that requires a major release. A change to Optional+Computed would not be breaking and might be appropriate in this case, though I haven't reviewed it closely.

If you're going with O+C, I think the custom flattener is unnecessary.

wyardley commented 4 weeks ago

Ok, sounds good. I can try just the simplest way and see if it somehow magically Just Works.

wyardley commented 4 weeks ago

diff --git a/mmv1/third_party/terraform/services/container/node_config.go.erb b/mmv1/third_party/terraform/services/container/node_config.go.erb
index 14ec2f469..cf9631eec 100644
--- a/mmv1/third_party/terraform/services/container/node_config.go.erb
+++ b/mmv1/third_party/terraform/services/container/node_config.go.erb
@@ -602,6 +602,7 @@ func schemaNodeConfig() *schema.Schema {
                            "cpu_cfs_quota": {
                                Type:     schema.TypeBool,
                                Optional: true,
+                               Computed: true,
                                Description: `Enable CPU CFS quota enforcement for containers that specify CPU limits.`,
                            },
                            "cpu_cfs_quota_period": {
@@ -1175,7 +1176,6 @@ func expandKubeletConfig(v interface{}) *container.NodeKubeletConfig {
    }
    if cpuCfsQuota, ok := cfg["cpu_cfs_quota"]; ok {
        kConfig.CpuCfsQuota = cpuCfsQuota.(bool)
-       kConfig.ForceSendFields = append(kConfig.ForceSendFields, "CpuCfsQuota")
    }
    if cpuCfsQuotaPeriod, ok := cfg["cpu_cfs_quota_period"]; ok {
        kConfig.CpuCfsQuotaPeriod = cpuCfsQuotaPeriod.(string)

With ForceSendFields it still sets it to the false value, but with it removed, it seems to behave more closely to how I'd expect.

I'm not sure if there are cases where that change would cause an issue.

edit: It creates as I'd expect, but then it gets the values mixed up, as you also might expect.

For example, with the code above only, when the cluster was created with the default (true) value implicitly, it shows up as a noop on subsequent plans if the value is set to false in the config, and as a permadiff if it's set to true.