terraform-google-modules / terraform-google-kubernetes-engine

Configures opinionated GKE clusters
https://registry.terraform.io/modules/terraform-google-modules/kubernetes-engine/google
Apache License 2.0
1.15k stars 1.17k forks source link

fix: send provider enum values for `insecureKubeletReadonlyPortEnabled` #2145

Closed wyardley closed 1 month ago

wyardley commented 1 month ago

Add insecureKubeletReadonlyPortEnabled to node_config.kubelet_config for the default node-pool and for additional pools. It may also be necessary to define the top level node_config more broadly for the case where remove_default_node_pool is set to false, which should probably be handled separately.

Also, the upstream provider (intentionally) uses an enum of "TRUE" / "FALSE" vs. a boolean. Update the code to follow this, and add a test case (also suggested by @apeabody) that covers the cluster level setting vs node pool one.

Fixes #2013

apeabody commented 1 month ago

/gcbrun

apeabody commented 1 month ago

Hmm, looks like we are getting lower-case in module.example.module.gke.google_container_node_pool.pools["pool-03"]:

Step #70 - "converge node-pool-local":           + kubelet_config {
Step #70 - "converge node-pool-local":               + cpu_cfs_quota                          = true
Step #70 - "converge node-pool-local":               + cpu_manager_policy                     = "static"
Step #70 - "converge node-pool-local":               + insecure_kubelet_readonly_port_enabled = "false"
Step #70 - "converge node-pool-local":               + pod_pids_limit                         = 4096
Step #70 - "converge node-pool-local":             }
tep #70 - "converge node-pool-local":        Error: expected node_config.0.kubelet_config.0.insecure_kubelet_readonly_port_enabled to be one of ["FALSE" "TRUE"], got false
Step #70 - "converge node-pool-local":        
Step #70 - "converge node-pool-local":          with module.example.module.gke.google_container_node_pool.pools["pool-03"],
Step #70 - "converge node-pool-local":          on ../../../modules/beta-public-cluster/cluster.tf line 656, in resource "google_container_node_pool" "pools":
Step #70 - "converge node-pool-local":         656:   node_config {
wyardley commented 1 month ago

Hmm, looks like we are getting lower-case in

I'll take a look. I thought I checked for any spots I'd missed, but apparently not.

apeabody commented 1 month ago

Thanks @wyardley!

Probably needs a make build

wyardley commented 1 month ago

Probably needs a make build

Got to take a look. Doesn't look like it will work exactly as-is -- will make the adjustment that I think it needs and will build and push the results.

edit: See if what's there now looks good to you

apeabody commented 1 month ago

/gcbrun

apeabody commented 1 month ago

Probably needs a make build

Got to take a look. Doesn't look like it will work exactly as-is -- will make the adjustment that I think it needs and will build and push the results.

edit: See if what's there now looks good to you

Thanks @wyardley - Yes, this looks good, and less complicated that what I suggested.

apeabody commented 1 month ago

I was expecting we might need to update the test asset, however this is unexpected.

The node_pool_defaults plan now looks good:

TestPrivateZonalWithNetworking 2024-10-17T20:54:28Z command.go:185:       + node_pool_defaults {
TestPrivateZonalWithNetworking 2024-10-17T20:54:28Z command.go:185:           + node_config_defaults {
TestPrivateZonalWithNetworking 2024-10-17T20:54:28Z command.go:185:               + insecure_kubelet_readonly_port_enabled = "FALSE"
TestPrivateZonalWithNetworking 2024-10-17T20:54:28Z command.go:185:               + logging_variant                        = (known after apply)
TestPrivateZonalWithNetworking 2024-10-17T20:54:28Z command.go:185: 
TestPrivateZonalWithNetworking 2024-10-17T20:54:28Z command.go:185:               + gcfs_config (known after apply)
TestPrivateZonalWithNetworking 2024-10-17T20:54:28Z command.go:185:             }
TestPrivateZonalWithNetworking 2024-10-17T20:54:28Z command.go:185:         }

However the default-node-pool is showing true?

    private_zonal_with_networking_test.go:81: 
            Error Trace:    /workspace/test/integration/private_zonal_with_networking/private_zonal_with_networking_test.go:81
                                        /builder/home/go/pkg/mod/github.com/!google!cloud!platform/cloud-foundation-toolkit/infra/blueprint-test@v0.16.2/pkg/tft/terraform.go:638
                                        /builder/home/go/pkg/mod/github.com/!google!cloud!platform/cloud-foundation-toolkit/infra/blueprint-test@v0.16.2/pkg/tft/terraform.go:670
                                        /builder/home/go/pkg/mod/github.com/!google!cloud!platform/cloud-foundation-toolkit/infra/blueprint-test@v0.16.2/pkg/utils/stages.go:31
                                        /builder/home/go/pkg/mod/github.com/!google!cloud!platform/cloud-foundation-toolkit/infra/blueprint-test@v0.16.2/pkg/tft/terraform.go:670
            Error:          Not equal: 
                            expected: map[string]interface {}{"diskSizeGb":100, "diskType":"pd-standard", "effectiveCgroupMode":"EFFECTIVE_CGROUP_MODE_V2", "imageType":"COS_CONTAINERD", "labels":map[string]interface {}{"cluster_name":"CLUSTER_NAME", "node_pool":"default-node-pool"}, "loggingConfig":map[string]interface {}{"variantConfig":map[string]interface {}{"variant":"DEFAULT"}}, "machineType":"e2-medium", "metadata":map[string]interface {}{"cluster_name":"CLUSTER_NAME", "disable-legacy-endpoints":"true", "node_pool":"default-node-pool"}, "oauthScopes":[]interface {}{"https://www.googleapis.com/auth/cloud-platform"}, "serviceAccount":"SERVICE_ACCOUNT", "shieldedInstanceConfig":map[string]interface {}{"enableIntegrityMonitoring":true}, "tags":[]interface {}{"gke-CLUSTER_NAME", "gke-CLUSTER_NAME-default-node-pool"}, "windowsNodeConfig":map[string]interface {}{}, "workloadMetadataConfig":map[string]interface {}{"mode":"GKE_METADATA"}}
                            actual  : map[string]interface {}{"diskSizeGb":100, "diskType":"pd-standard", "effectiveCgroupMode":"EFFECTIVE_CGROUP_MODE_V2", "imageType":"COS_CONTAINERD", "kubeletConfig":map[string]interface {}{"insecureKubeletReadonlyPortEnabled":true}, "labels":map[string]interface {}{"cluster_name":"CLUSTER_NAME", "node_pool":"default-node-pool"}, "loggingConfig":map[string]interface {}{"variantConfig":map[string]interface {}{"variant":"DEFAULT"}}, "machineType":"e2-medium", "metadata":map[string]interface {}{"cluster_name":"CLUSTER_NAME", "disable-legacy-endpoints":"true", "node_pool":"default-node-pool"}, "oauthScopes":[]interface {}{"https://www.googleapis.com/auth/cloud-platform"}, "serviceAccount":"SERVICE_ACCOUNT", "shieldedInstanceConfig":map[string]interface {}{"enableIntegrityMonitoring":true}, "tags":[]interface {}{"gke-CLUSTER_NAME", "gke-CLUSTER_NAME-default-node-pool"}, "windowsNodeConfig":map[string]interface {}{}, "workloadMetadataConfig":map[string]interface {}{"mode":"GKE_METADATA"}}

                            Diff:
                            --- Expected
                            +++ Actual
                            @@ -1,2 +1,2 @@
                            -(map[string]interface {}) (len=14) {
                            +(map[string]interface {}) (len=15) {
                              (string) (len=10) "diskSizeGb": (float64) 100,
                            @@ -5,2 +5,5 @@
                              (string) (len=9) "imageType": (string) (len=14) "COS_CONTAINERD",
                            + (string) (len=13) "kubeletConfig": (map[string]interface {}) (len=1) {
                            +  (string) (len=34) "insecureKubeletReadonlyPortEnabled": (bool) true
                            + },
                              (string) (len=6) "labels": (map[string]interface {}) (len=2) {
            Test:           TestPrivateZonalWithNetworking
wyardley commented 1 month ago

It’s because at API level, it’s a bool, and this is the API response vs the terraform state

so I think this is expected, if unintuitive. I can update that value shortly.

apeabody commented 1 month ago

It’s because at API level, it’s a bool, and this is the API response vs the terraform state

so I think this expected, if unintuitive. I can update that value shortly.

Hey @wyardley! I'm not thinking the string vs bool types, but we set the cluster default to false, and the default node pool is showing true? Perhaps there is something special about the default node pool.

wyardley commented 1 month ago

Let me look. If we only set node_pool_defaults we’ll probably need to set it in node_config also for the default pool I think

apeabody commented 1 month ago

Let me look. If we only set node_pool_defaults we’ll probably need to set it in node_config also for the default pool I think

Yeah, that is my guess as well.

wyardley commented 1 month ago

So a couple questions: 1) For the default pool, should we allow setting this explicitly in var.node_pools[0] as well as taking the default cluster setting if not? 2) In the case where remove_default_node_pool is set to false, do we currently have google_container_cluster.node_config.kubelet_config defined in the config at all? For sure we'll need to add it here, I think, but if we need to support the case where we have remove_default_node_pool set to false, we may also need to be able to define it in the top level node_config? I may be missing something, but couldn't find that the module is using anything in that section now?

Also, should I try to squeeze this in the same PR?

apeabody commented 1 month ago

Thanks for digging into this @wyardley!

So a couple questions:

  1. For the default pool, should we allow setting this explicitly in var.node_pools[0] as well as taking the default cluster setting if not?

I think that would be the ideal situation for the default node pool - do an explicit lookup, failing that use the cluster default, failing that then set null

  1. In the case where remove_default_node_pool is set to false, do we currently have google_container_cluster.node_config.kubelet_config defined in the config at all? For sure we'll need to add it here, I think, but if we need to support the case where we have remove_default_node_pool set to false, we may also need to be able to define it in the top level node_config? I may be missing something, but couldn't find that the module is using anything in that section now?

Yeah, the best practice recommendation is to remove the default node pool, but it's not required. Due it's typical removal, I suspect it's been an accidental oversight theses weren't added, and so should be updated to match.

Also, should I try to squeeze this in the same PR?

I'm comfortable if everything is in a single PR, however I can certainly open a seperate PR to add kubelet_config to the default node pool to keep the PRs more focused. Let me know your preference.

wyardley commented 1 month ago

I pushed up an update of what I have so far (only updated one of the tests).

I'm going to do what I probably should have done sooner, and actually run this locally, so I can try a few things a bit faster.

apeabody commented 1 month ago

/gcbrun

wyardley commented 1 month ago

Fixing one minor issue with that, and will push again.

however I can certainly open a seperate PR to add kubelet_config to the default node pool to keep the PRs more focused

Yes, I think that would be a good idea, especially since it may need to include some additional items, and you probably have a better idea of where exactly it should go.

apeabody commented 1 month ago

/gcbrun

wyardley commented 1 month ago

The argument "cpu_manager_policy" is required, but no definition was found.

~@apeabody I can add a dumb fix for this in the short term, but this can probably be removed when support is dropped for TPG 5.x.~

apeabody commented 1 month ago

Thanks @wyardley - I have https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/pull/2147 to add the kubelet_config to the default-node-pool. That should resolve the missing cpu_manager_policy.

wyardley commented 1 month ago

I have a PR to add the kubelet_config to the default-node-pool. That should resolve the missing cpu_manager_policy.

Yeah, actually, that's probably the needed / correct fix here. Do you want to send that through first and I'll rebase off of it?

apeabody commented 1 month ago

I have a PR to add the kubelet_config to the default-node-pool. That should resolve the missing cpu_manager_policy.

Yeah, actually, that's probably the needed / correct fix here. Do you want to send that through first and I'll rebase off of it?

Yes, that is the plan, hopefully by tomorrow get it merged in.

apeabody commented 1 month ago

I have a PR to add the kubelet_config to the default-node-pool. That should resolve the missing cpu_manager_policy.

Yeah, actually, that's probably the needed / correct fix here. Do you want to send that through first and I'll rebase off of it?

Yes, that is the plan, hopefully by tomorrow get it merged in.

Hi @wyardley - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/pull/2147 is merged, can you rebase this as needed? Thanks!

wyardley commented 1 month ago

Hi @wyardley - https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/pull/2147 is merged, can you rebase this as needed? Thanks!

Yup, updated now. Let me know if the logic looks correct, and if the adjusted title makes sense (I think you've already added it to node_config in #2147?

Guessing we might still need to adjust one or more of the tests?

apeabody commented 1 month ago

/gcbrun

apeabody commented 1 month ago

FYI: Looks like the Windows node pool test has recently started timing out (tests are failing around 59-61 minutes), may need to adjust that test.