rancher / terraform-provider-rancher2

Terraform Rancher2 provider
https://www.terraform.io/docs/providers/rancher2/
Mozilla Public License 2.0
261 stars 226 forks source link

[BUG] kubelet-args are not supported via TF #1074

Closed boris-stojnev closed 1 year ago

boris-stojnev commented 1 year ago

Rancher Server Setup

Information about the Cluster

User Information

Provider Information

Describe the bug

Using kubelet-arg via Terraform rancher2 provider resulting in unexpected behavior of cluster not starting. This bug is described here https://github.com/rancher/rancher/issues/38112 I think this is specific to the Terraform provider, not the Rancher itself.

Kubelet-arg which is part of config under machine_selector_config should be list type not string. In other words, config should be able to accept list types.

a-blender commented 1 year ago

@boris-stojnev I see the associated issue has been closed https://github.com/rancher/rancher/issues/38112 with the workaround of setting kubelet_arg under machine_global_config in your Terraform config which then causes the args to successfully get pushed down to the nodes. For v2 prov clusters, machine_global_config is a string that must be provided in YAML format - here's an example.

machine_global_config = <<EOF
  config:
    cloud-provider-name: "external"
    kubelet-arg: "key1=value1,key2=value2"
EOF

Does this work for you?

a-blender commented 1 year ago

@boris-stojnev Any update on this issue?

boris-stojnev commented 1 year ago

@a-blender sorry for the late reply.

My workaround was placing kubelet-arg in config.yaml leveraging Ansible with other additional configuration I have.

As stated in the issue https://github.com/rancher/rancher/issues/38112, in that case changes will not be present in the Rancher UI. There is a separate section with Kubelet args in Rancher UI cluster configuration.

What you suggest will definitely not work, maybe this one will, but I didn’t test it and don’t know if there are side effects.

machine_global_config = <<EOF
cloud-provider-name: "external"
kubelet-arg: 
- key1=value1
- key2=value2
EOF

As I said it should be fixed as I describe so that TF provider follows Rancher UI configuration options. It’s misleading when you check UI and see that kubelet-args are not defined, but actually they are.

After proper fix, we should be able to use kubelet-arg like in the below example, in the machine_selector_config, and it will be shown in Rancher UI. Then, this would be the proper config:

machine_selector_config {
  config = {
    cloud-provider-name = "external"
    kubelet-arg = {
      key1 = value1
      key2 = value2
    }
  }
}
a-blender commented 1 year ago

@boris-stojnev Thank you for the additional details. After investigating more, [7/26 Correction] machine global config is for rendering rke2 arguments into /etc/rancher/rke2/config.yaml.d/50-rancher.yaml on all nodes, machine selector config is for rendering arguments into the same file but only for machines with a machine label selector. If an arg is set via machine selector config without specifying a label, that arg will be applied to all nodes. Most customers at this point, use machine global config for that use case though.

image

It makes sense kubelet arg under machine_global_config is not exposed in the UI, but you are right it should be.

In tf, machine_selector_config is a complex type and subfield config is a Type.Map https://github.com/rancher/terraform-provider-rancher2/blob/5a952a87c3cf479694a8ca4cefa0d510d8d2b4d2/rancher2/schema_cluster_v2_rke_config_system_config.go#L65-L69 which makes kubelet-arg a string, so I don't see how the example I provided will not work. It fits the schema. Did you try with machine_selector_config and check if the args got passed down to the nodes and was present in the ui? Let me try it on my end.

boris-stojnev commented 1 year ago

@a-blender Regarding your example, first of all, it’s in yaml format which is not supported by machine_selector_config, and second, values of kubelet-arg will be interpreted if I remember correctly like in the issue we linked earlier.

Additional note regarding what you showed in the picture, it can’t be accomplished via TF. Cloud-provider-name=external from my example will not be shown on the UI, because it’s not under kubelet-arg so it will be treated as level one config. More info can be found here https://ranchermanager.docs.rancher.com/reference-guides/cluster-configuration/rancher-server-configuration/rke2-cluster-configuration#machineselectorconfig

I’m using machine_selector_config as suggested here https://ranchermanager.docs.rancher.com/v2.6/reference-guides/rancher-security/rancher-v2.6-hardening-guides/rke2-hardening-guide-with-cis-v1.6-benchmark for setting other configs without labels, and that works.

On the other hand, Rancher UI is only showing the machine selector for kubelet args, and that is what's making confusion here. And additional Kubelet Args for any machine can't be set via TF to be shown on the UI.

To sum up, you can’t have kubelet arg in machine_selector_config.

a-blender commented 1 year ago

@boris-stojnev Sorry, I was asking about machine_global_config earlier which supports yaml. Made a spelling error in the comment. Since investigating, I don't think using machine_global_config is an ideal workaround because it deviates from the options exposed in the rancher UI.

I tried with the following config

machine_selector_config {
      config = {
        kubelet-arg = "cloud-provider=external"
      }
    }

with TF 1.25 and reproduced https://github.com/rancher/rancher/issues/38112

The kubelet-arg = "--protect-kernel-defaults" causes rancher to treat every char as an array element which makes it look like this in rancher. machineSelectorConfig: - config: kubelet-arg: - '-' - '-' - p - r - o - t - e - c - t - '-' - k - e - r - "n" - e - l - '-' - d - e - f - a - u - l - t - s

causing the UI to look like this

image

TF needs to have parity with Rancher so it needs to support passing multiple kubelet-arg to the machine_selector_config but from your comment https://github.com/rancher/terraform-provider-rancher2/issues/1074#issuecomment-1648412146 and what I see, this doesn't appear possible without a fix. This is a confirmed bug. I think updating the machine_selector_config.Config to a Type.List should solve this but will have to test it. We may need to add state migration logic as well or users with clusters already provisioned with an earlier version of TF may see them break.

boris-stojnev commented 1 year ago

@a-blender Now, while we speak, something crossed my mind. What I didn’t test, and maybe it's worth it for confirmation, defining labels. So the example should be something like below. Every argument as separate machine_selector_config. Maybe in this way value of kubelet-arg wouldn't be treated as array of characters. 

machine_selector_config {
  config = { 
    kubelet-arg = "cloud-provider-name=external" 
  }
  machine_label_selector {
    match_expressions {
      key = example-key
      operator = In
      values = [example-value1, example-value2]
     }
    match_labels {
      key1 = value1
      key2 = value2
    }
  }
}
machine_selector_config { 
  config = { 
    kubelet-arg = "config-key=config-value" 
  }
  machine_label_selector {
    match_expressions {
      key = example-key
      operator = In
      values = [example-value1, example-value2]
     }
    match_labels {
      key1 = value1
      key2 = value2
    }
  }
}

But wait, as I see from the UI, you can have multiple args under same label selector. Ah, probably it wouldn't do the trick.

a-blender commented 1 year ago

@boris-stojnev Yeah, setting one arg under kubelet-arg inside a single block of machine_selector_config doesn't translate properly so several of the same will not.

a-blender commented 1 year ago

QA Test Template

Problem

Machine Selector Config kubelet-arg values could not be passed via Terraform to downstream machines and would not show up in the rancher UI.

Solution

My solution is to convert machine_selector_config.config from Type.Map (string map) to Type.String that supports yaml like Machine Global Config + state migration logic to handle the schema update. This allows for users to input both strings and lists, which allows subfield kubelet-arg to be passed as a list which is what Rancher is expecting. This feature now works with the following configuration

machine_selector_config {
  config = <<EOF
    kubelet-arg:
      - key1=value1
      - key2=value2
  EOF
}

Example

machine_selector_config {
  config = <<EOF
    kubelet-arg:
      - protect-kernel-defaults=true
      - cloud-provider=external
EOF
    }

See more details here.

Testing

Terraform RC: 3.2.0-rc4

Engineering Testing

Manual Testing

Test plan

Automated Testing

Run go test -v ./rancher2 to make sure all automated tests pass.

QA Testing Considerations

Regressions Considerations

N/A - fix was designed to avoid Machine Selector Config regression. That being said, users can only configure one Machine Selector config via TF whereas in the rancher backend you can configure multiple of the same field. No customers are asking for this, but just to note.

a-blender commented 1 year ago

Update to Regressions Considerations If a user wants to configure multiple Machine Selector Configs to assign kubelet args to specific cluster nodes based on node labels as is supported in Rancher, they can define that in a TF config using the same pattern in separate blocks

machine_selector_config {
  config = <<EOF
    kubelet-arg:
      - key1=value1
      - key2=value2
  EOF
}
machine_selector_config {
  config = <<EOF
    kubelet-arg:
      - key1=value1
      - key2=value2
  EOF
}

This will show up in Rancher as

image image

Add machine selector labels to each config as needed.

a-blender commented 1 year ago

Please wait until 3.2.0-rc3 to test, thank you.

slickwarren commented 1 year ago

todo: create TFP automation for kubelet-args

slickwarren commented 1 year ago

moving back to waiting for RC based on this comment: https://github.com/rancher/terraform-provider-rancher2/issues/1074#issuecomment-1712155225 as only rc2 is available.

a-blender commented 1 year ago

@slickwarren Jacob already cut rc3 https://github.com/rancher/terraform-provider-rancher2/releases/tag/v3.2.0-rc3 but assets are not finished generating. Check back shortly!

Josh-Diamond commented 1 year ago

Ticket rancher/dashboard#1074 - Test Results - ❌ - RE-OPENED

Verified on Rancher v2.8-0ff5fe88aa87c0383b7487b975ee8929df674185-head:

Scenario Test Case Result
1. Provision a downstream rke2 cluster with Machine Selector Config and 2 kubelet args set
2. Update: Add/remove a kubelet-arg via tf pending
3. Provision a downstream rke2 cluster with tf 3.1.0 => add machine selector config with 2 kubelet args via the rancher ui => Upgrade tf to v3.2.0-rc3 pending

Scenario 1 -

  1. Fresh install of Rancher v2.8-head
  2. Using tfp-rancher2 v3.2.0-rc3, provision a downstream RKE2 AWS Node driver cluster, using machine_selector_config block and defining 2 kubelet arguments - [ i used the main.tf shown below]
    
    terraform {
    required_providers {
    rancher2 = {
      source  = "terraform.local/local/rancher2"
      version = "3.2.0-rc3"
    }
    }
    }

provider "rancher2" { api_url = "" token_key = "" insecure = true }

resource "rancher2_cloud_credential" "rancher2_cloud_credential" { name = "tf-creds-rke2" amazonec2_credential_config { access_key = "" secret_key = "" } }

resource "rancher2_machine_config_v2" "rancher2_machine_config_v2" { generate_name = "tf-rke2" amazonec2_config { ami = "" region = "" security_group = [""] subnet_id = "" vpc_id = "" zone = "" } }

resource "rancher2_cluster_v2" "rancher2_cluster_v2" { name = "jkeslarrr3" kubernetes_version = "v1.27.6+rke2r1" enable_network_policy = false default_cluster_role_for_project_members = "user" rke_config { machine_pools { name = "pool1" cloud_credential_secret_name = rancher2_cloud_credential.rancher2_cloud_credential.id control_plane_role = false etcd_role = true worker_role = false quantity = 1 machine_config { kind = rancher2_machine_config_v2.rancher2_machine_config_v2.kind name = rancher2_machine_config_v2.rancher2_machine_config_v2.name } } machine_pools { name = "pool2" cloud_credential_secret_name = rancher2_cloud_credential.rancher2_cloud_credential.id control_plane_role = true etcd_role = false worker_role = false quantity = 1 machine_config { kind = rancher2_machine_config_v2.rancher2_machine_config_v2.kind name = rancher2_machine_config_v2.rancher2_machine_config_v2.name } } machine_pools { name = "pool3" cloud_credential_secret_name = rancher2_cloud_credential.rancher2_cloud_credential.id control_plane_role = false etcd_role = false worker_role = true quantity = 1 machine_config { kind = rancher2_machine_config_v2.rancher2_machine_config_v2.kind name = rancher2_machine_config_v2.rancher2_machine_config_v2.name } } machine_selector_config { config = <<EOF kubelet-arg:

Additional Context:

When removing the machine_selector_config block, which defined 2 kubelet arguments, from the main.tf shown above, tfp-rancher2 v3.2.0-rc3 was successful in spinning up the downstream cluster.

a-blender commented 1 year ago

@Josh-Diamond There's some confusion about how/which arguments to pass via TF to the kubelet for a working v2 cluster. Here's a working example. I updated the Test Template.

// Working example

machine_selector_config {
  config = <<EOF
    kubelet-arg:
      - protect-kernel-defaults=true
      - cloud-provider=external
EOF
}

I also missed a backport to release/v3. After I get that in and cut a new RC, please re-test this on v3.2.0-rc4.

Josh-Diamond commented 1 year ago

Ticket rancher/dashboard#1074 - Test Results - ✅

Verified on Rancher v2.7.8-rc1:

Scenario Test Case Result
1. Provision a downstream rke2 cluster with Machine Selector Config and 2 kubelet args set
2. Update: Add/remove a kubelet-arg via tf
3. Provision a downstream rke2 cluster with tf 3.1.0 => Upgrade tf to v3.2.0-rc3 and add machine selector config with 2 kubelet args => update/modify kubelet args once more and verify they are successfully accepted + functional pending/blocked

Scenario 1 - ✅

  1. Fresh install of Rancher v2.7.8-rc1
  2. Using tfp-rancher2 v3.2.0-rc4, provision a downstream RKE2 AWS Node driver cluster, using machine_selector_config block and defining 2 kubelet arguments - [ i used the main.tf shown below]
    
    terraform {
    required_providers {
    rancher2 = {
      source  = "terraform.local/local/rancher2"
      version = "3.2.0-rc3"
    }
    }
    }

provider "rancher2" { api_url = "" token_key = "" insecure = true }

resource "rancher2_cloud_credential" "rancher2_cloud_credential" { name = "tf-creds-rke2" amazonec2_credential_config { access_key = "" secret_key = "" } }

resource "rancher2_machine_config_v2" "rancher2_machine_config_v2" { generate_name = "tf-rke2" amazonec2_config { ami = "" region = "" security_group = [""] subnet_id = "" vpc_id = "" zone = "" } }

resource "rancher2_cluster_v2" "rancher2_cluster_v2" { name = "jkeslar" kubernetes_version = "v1.26.8+rke2r1" enable_network_policy = false default_cluster_role_for_project_members = "user" rke_config { machine_selector_config { config = <<EOF kubelet-arg:


Scenario 2 - ✅

  1. Resuming where Scenario 1 left off, update max-pod limit to 255, using tfp-rancher2 v3.2.0-rc4 and re-run terraform apply
  2. Verified - max-pod kubelet arg successfully updated; as expected
  3. Using tfp-rancher2 v3.2.0-rc4, remove and delete kubelet args
  4. Verified - kubelet args successfully removed; as expected

Scenario 3 - ✅

  1. Fresh install of Rancher v2.7.8
  2. Using tfp-rancher2 v3.1.0, provision a downstream RKE2 AWS Node driver cluster
  3. Once active, update tfp-rancher2 to v3.2.0-rc5
  4. Using tfp-rancher2 v3.2.0-rc5, define a machine_selector_config block and set multiple kubelet-args under config
  5. Verified - cluster successfully and accurately updates w/ kubelet args; verified via cluster.yml
Josh-Diamond commented 1 year ago

Blocked by https://github.com/rancher/terraform-provider-rancher2/issues/1243#issuecomment-1753486076

Josh-Diamond commented 1 year ago

no longer blocked by https://github.com/rancher/terraform-provider-rancher2/issues/1243

https://github.com/rancher/terraform-provider-rancher2/issues/1243 has been identified to be Rancher UI specific, and is not caused or related to tfp-rancher2. Although this issue was encountered in my testing, this is purely a UI symptom, unrelated to tfp-rancher2.

Resuming testing now...

Josh-Diamond commented 1 year ago

above test results have been updated + completed. Closing out this issue now