[BUG] Occasionally RKE2 cluster gets destroyed after cluster configuration is changed using terraform provider

riuvshyn commented 1 year ago

Important: Please see https://github.com/rancher/terraform-provider-rancher2/issues/993#issuecomment-1611922983 on the status of this issue following completed investigations.

Rancher Server Setup

Rancher version: 2.6.8
Installation option (Docker install/Helm Chart):
- installed as helm chart
- running on k3s 1.24.4
Proxy/Cert Details: N/A

Information about the Cluster

Kubernetes version: 1.23.9
Cluster Type Downstream:
- Custom RKE2 v1.23.9+rke2r1 running on AWS
- Provisioned with Terraform provider rancher/rancher2 version 1.24.1

cluster configuration:

``` { "kubernetesVersion": "v1.23.9+rke2r1", "rkeConfig": { "upgradeStrategy": { "controlPlaneConcurrency": "1", "controlPlaneDrainOptions": { "enabled": false, "force": false, "ignoreDaemonSets": true, "IgnoreErrors": false, "deleteEmptyDirData": true, "disableEviction": false, "gracePeriod": 0, "timeout": 10800, "skipWaitForDeleteTimeoutSeconds": 600, "preDrainHooks": null, "postDrainHooks": null }, "workerConcurrency": "10%", "workerDrainOptions": { "enabled": false, "force": false, "ignoreDaemonSets": true, "IgnoreErrors": false, "deleteEmptyDirData": true, "disableEviction": false, "gracePeriod": 0, "timeout": 10800, "skipWaitForDeleteTimeoutSeconds": 600, "preDrainHooks": null, "postDrainHooks": null } }, "chartValues": null, "machineGlobalConfig": { "cloud-provider-name": "aws", "cluster-cidr": "100.64.0.0/13", "cluster-dns": "100.64.0.10", "cluster-domain": "cluster.local", "cni": "none", "disable": [ "rke2-ingress-nginx", "rke2-metrics-server", "rke2-canal" ], "disable-cloud-controller": false, "kube-apiserver-arg": [ "allow-privileged=true", "anonymous-auth=false", "feature-gates=CustomCPUCFSQuotaPeriod=true", "api-audiences=https://-oidc.s3.eu-central-1.amazonaws.com,https://kubernetes.default.svc.cluster.local,rke2", "audit-log-maxage=90", "audit-log-maxbackup=10", "audit-log-maxsize=500", "audit-log-path=/var/log/k8s-audit/audit.log", "audit-policy-file=/etc/kubernetes-/audit-policy.yaml", "authorization-mode=Node,RBAC", "bind-address=0.0.0.0", "enable-admission-plugins=PodSecurityPolicy,NodeRestriction", "event-ttl=1h", "kubelet-preferred-address-types=InternalIP,Hostname,ExternalIP", "profiling=false", "request-timeout=60s", "runtime-config=api/all=true", "service-account-key-file=/etc/kubernetes-wise/service-account.pub", "service-account-lookup=true", "service-account-issuer=https://-oidc.s3.eu-central-1.amazonaws.com", "service-account-signing-key-file=/etc/kubernetes-wise/service-account.key", "service-node-port-range=30000-32767", "shutdown-delay-duration=60s", "tls-min-version=VersionTLS12", "tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "tls-min-version=VersionTLS12", "v=2" ], "kube-apiserver-extra-mount": [ "/etc/kubernetes-wise:/etc/kubernetes-wise:ro", "/var/log/k8s-audit:/var/log/k8s-audit:rw" ], "kube-controller-manager-arg": [ "allocate-node-cidrs=true", "attach-detach-reconcile-sync-period=1m0s", "bind-address=0.0.0.0", "configure-cloud-routes=false", "feature-gates=CustomCPUCFSQuotaPeriod=true", "leader-elect=true", "node-monitor-grace-period=2m", "pod-eviction-timeout=220s", "profiling=false", "service-account-private-key-file=/etc/kubernetes-wise/service-account.key", "use-service-account-credentials=true", "terminated-pod-gc-threshold=12500", "tls-min-version=VersionTLS12", "tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "tls-min-version=VersionTLS12" ], "kube-controller-manager-extra-mount": [ "/etc/kubernetes-wise:/etc/kubernetes-wise:ro" ], "kube-proxy-arg": [ "conntrack-max-per-core=131072", "conntrack-tcp-timeout-close-wait=0s", "metrics-bind-address=0.0.0.0", "proxy-mode=iptables" ], "kube-scheduler-arg": [ "bind-address=0.0.0.0", "port=0", "secure-port=10259", "profiling=false", "leader-elect=true", "tls-min-version=VersionTLS12", "v=2", "tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "tls-min-version=VersionTLS12" ], "kubelet-arg": [ "network-plugin=cni", "cni-bin-dir=/opt/cni/bin/", "cni-conf-dir=/etc/cni/net.d/", "feature-gates=CustomCPUCFSQuotaPeriod=true", "config=/etc/kubernetes-wise/kubelet.yaml", "exit-on-lock-contention=true", "lock-file=/var/run/lock/kubelet.lock", "pod-infra-container-image=docker-k8s-gcr-io./pause:3.1", "register-node=true", "tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "tls-min-version=VersionTLS12", "v=4" ], "profile": "cis-1.6", "protect-kernel-defaults": true, "service-cidr": "100.64.0.0/13" }, "additionalManifest": "", "registries": { "mirrors": { "docker.io": { "endpoint": [ "" ] }, "gcr.io": { "endpoint": [ "" ] }, "k8s.gcr.io": { "endpoint": [ "" ] }, "quay.io": { "endpoint": [ "" ] } }, "configs": { "": {} } }, "etcd": { "snapshotScheduleCron": "0 */6 * * *", "snapshotRetention": 12, "s3": { "endpoint": "s3.eu-central-1.amazonaws.com", "bucket": "etcd-backups-", "region": "eu-central-1", "folder": "etcd" } } }, "localClusterAuthEndpoint": {}, "defaultClusterRoleForProjectMembers": "user", "enableNetworkPolicy": false } ```

Additional info:

I am using custom CNI: aws-vpc-cni installed via additional_manifest

Describe the bug Occasionally simple cluster configuration change for example change lables in manifests passed via additional_manifest applied with terraform provider causing managed RKE2 cluster to destroy.

terraform plan looks similar to this:

Terraform Plan output

``` Terraform will perform the following actions: # module.cluster.rancher2_cluster_v2.this will be updated in-place ~ resource "rancher2_cluster_v2" "this" { id = "fleet-default/o11y-euc1-se-main01" name = "o11y-euc1-se-main01" # (10 unchanged attributes hidden) ~ rke_config { ~ additional_manifest = <<-EOT --- apiVersion: v1 kind: Namespace metadata: labels: - test: test + test1: test1 name: my-namespace EOT } } ```

Sometimes once change like this is applied rancher immediately trying to delete that managed cluster for some reason. On UI it looks like this:

Rancher UI screenshot:

![image](https://user-images.githubusercontent.com/53786845/188906317-5724573f-f86d-4967-8f00-13adb5c51c3e.png)

Rancher logs:

rancher logs:

``` 2022/09/08 08:58:30 [DEBUG] [planner] rkecluster fleet-default/: unlocking 810235e7-ecc0-4ba7-81c8-55d778594926 2022/09/08 08:58:30 [INFO] [planner] rkecluster fleet-default/: waiting: configuring bootstrap node(s) custom-7808e68fb38f: waiting for plan to be applied 2022/09/08 08:58:30 [DEBUG] [CAPI] Cannot retrieve CRD with metadata only client, falling back to slower listing 2022/09/08 08:58:30 [DEBUG] DesiredSet - Patch rbac.authorization.k8s.io/v1, Kind=Role fleet-default/crt--nodes-manage for auth-prov-v2-roletemplate- nodes-manage -- [PATCH:{"metadata":{"annotations":{"objectset.rio.cattle.io/applied":"H4sIAAAAAAAA/4xRTY/bIBD9K9UcK5Oa4NjYUk899FCph9WqlyqHAYYNXQwW4LTSKv+9IrtVrK36cYMH8+Z9PMFMBQ0WhOkJMIRYsLgYcr1G9Y10yVR2ycWdxlI87Vx85wxMgGs5sSXFMzvvWYqeCs2Lx0JMG0ar5mypx5ZD80ei+D1QYg/nR5hgxoAPNFMomw9n0bz55IJ5fxc93b8s+CdhwJlgghANZfbM+18zeUFdB+HSgE50DeLezZQLzgtMYfW+AY+K/F/jOWE+wQS94vuDGMS+l0O3Jz0oGgY96rFFtKq1qA2146Gv214U61Rep8deudjqtJ6oMEMWV1+qw+rkjiwlCpoyTF+fABf3hVJ2McAEtS5Xzy48bFOuHT26UGv94NdcKMFN029trtf+ueJCKdsx3fKedQcpmLIG2SDwICxHYcjC5XhpIK3+JuZjiutSb6CfN+1+sEeZdy7CsYFEOa5J0+fq8vppzSXOjFNr0XDkUiA0v1CSqDk/jGPf9zdUKKV6MnLUvb2hfTsMrUBOYlQ3tBOi10PX7Tu7YRhkK6mXVgm5YRiF4aPppDooudV61TmjPrlAuT6cKakr+BaOl+PlZwAAAP//1WFOc2MDAAA"}},"rules":[{"apiGroups":["cluster.x-k8s.io"],"resourceNames":["custom-1e0fad1a183a","custom-e8ac11599666","custom-3bbb6ed89c6f","custom-607703a1e39b","custom-4336c74424f6","custom-7808e68fb38f","custom-93d19d48b5b8"],"resources":["machines"],"verbs":["*"]}]}, ORIGINAL:{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"Role","metadata":{"annotations":{"objectset.rio.cattle.io/applied":"H4sIAAAAAAAA/4yRT4/bLBDGv8qrOb4KqQk2xpZ66qGHSj2sql6qHAYYNnRtsACnlaJ894rsVo626p8bPDDP/GaeC8xU0GJBGC+AIcSCxceQ6zXqr2RKprJPPu4NljLR3sc33sIIuJYTW1I8s/OBpThRoXmZsBAzltFqOFvqseGw+61R/BYoscfzE4wwY8BHmimUuw9nsfvvgw/27UOc6NNLg78aBpwJRgjRUmbPvv9Ukxc0tRCuO5hQ0/THLZwwn2AEqfmhE704SNW3BzK9pr43gxkaRKcbh8ZSM3Symr6AmVReL4m9gr3HcRNRYZYcrlOpg1TgB3KUKBjKMH65AC7+M6XsY4ARaiq+nn14vF9mjeLJh5reu2nNhRJsTL+Ett5i5poLrV3LTMMlazslmHYWWS+wE46jsOTgerzuIK3TBvM+xXWpNzDPnfbf2ZPKex/huINEOa7J0Mc65e3TmkucWa8aRVI5LZSD3U91EJYPtlW602pTOTUOLUeuBG4qKTScd8MgpdxUobWWZNVg5J2vbPq+EchJDHpTWyGk6dv20Dp5z3rjnNGcfKBcH86U9E38H47X4/VHAAAA///VWUNFSgMAAA","objectset.rio.cattle.io/id":"auth-prov-v2-roletemplate-","objectset.rio.cattle.io/owner-gvk":"management.cattle.io/v3, Kind=RoleTemplate","objectset.rio.cattle.io/owner-name":"nodes-manage","objectset.rio.cattle.io/owner-namespace":""},"labels":{"objectset.rio.cattle.io/hash":"6b125373268742ec7be77c9c90aafb0facde0956"},"name":"crt--nodes-manage","namespace":"fleet-default","ownerReferences":[{"apiVersion":"provisioning.cattle.io/v1","kind":"Cluster","name":"","uid":"1b13bbf4-c016-4583-bfda-73a53f1a3def"}]},"rules":[{"apiGroups":["cluster.x-k8s.io"],"resourceNames":["custom-7808e68fb38f","custom-93d19d48b5b8","custom-1e0fad1a183a","custom-e8ac11599666","custom-3bbb6ed89c6f","custom-607703a1e39b","custom-4336c74424f6"],"resources":["machines"],"verbs":["*"]}]}, MODIFIED:{"kind":"Role","apiVersion":"rbac.authorization.k8s.io/v1","metadata":{"name":"crt--nodes-manage","namespace":"fleet-default","creationTimestamp":null,"labels":{"objectset.rio.cattle.io/hash":"6b125373268742ec7be77c9c90aafb0facde0956"},"annotations":{"objectset.rio.cattle.io/applied":"H4sIAAAAAAAA/4xRTY/bIBD9K9UcK5Oa4NjYUk899FCph9WqlyqHAYYNXQwW4LTSKv+9IrtVrK36cYMH8+Z9PMFMBQ0WhOkJMIRYsLgYcr1G9Y10yVR2ycWdxlI87Vx85wxMgGs5sSXFMzvvWYqeCs2Lx0JMG0ar5mypx5ZD80ei+D1QYg/nR5hgxoAPNFMomw9n0bz55IJ5fxc93b8s+CdhwJlgghANZfbM+18zeUFdB+HSgE50DeLezZQLzgtMYfW+AY+K/F/jOWE+wQS94vuDGMS+l0O3Jz0oGgY96rFFtKq1qA2146Gv214U61Rep8deudjqtJ6oMEMWV1+qw+rkjiwlCpoyTF+fABf3hVJ2McAEtS5Xzy48bFOuHT26UGv94NdcKMFN029trtf+ueJCKdsx3fKedQcpmLIG2SDwICxHYcjC5XhpIK3+JuZjiutSb6CfN+1+sEeZdy7CsYFEOa5J0+fq8vppzSXOjFNr0XDkUiA0v1CSqDk/jGPf9zdUKKV6MnLUvb2hfTsMrUBOYlQ3tBOi10PX7Tu7YRhkK6mXVgm5YRiF4aPppDooudV61TmjPrlAuT6cKakr+BaOl+PlZwAAAP//1WFOc2MDAAA","objectset.rio.cattle.io/id":"auth-prov-v2-roletemplate-","objectset.rio.cattle.io/owner-gvk":"management.cattle.io/v3, Kind=RoleTemplate","objectset.rio.cattle.io/owner-name":"nodes-manage","objectset.rio.cattle.io/owner-namespace":""},"ownerReferences":[{"apiVersion":"provisioning.cattle.io/v1","kind":"Cluster","name":"","uid":"1b13bbf4-c016-4583-bfda-73a53f1a3def"}]},"rules":[{"verbs":["*"],"apiGroups":["cluster.x-k8s.io"],"resources":["machines"],"resourceNames":["custom-1e0fad1a183a","custom-e8ac11599666","custom-3bbb6ed89c6f","custom-607703a1e39b","custom-4336c74424f6","custom-7808e68fb38f","custom-93d19d48b5b8"]}]}, CURRENT:{"kind":"Role","apiVersion":"rbac.authorization.k8s.io/v1","metadata":{"name":"crt--nodes-manage","namespace":"fleet-default","uid":"2d3678e7-1904-442f-bfa6-ef4ad97baa40","resourceVersion":"32202831","creationTimestamp":"2022-09-08T07:56:23Z","labels":{"objectset.rio.cattle.io/hash":"6b125373268742ec7be77c9c90aafb0facde0956"},"annotations":{"objectset.rio.cattle.io/applied":"H4sIAAAAAAAA/4xRS48UIRD+K6aOphmboYd+JJ48eDDxsNl4MXMooNjBpaED9Giymf9umF3TkzU+bvBBffU9nmCmggYLwvQEGEIsWFwMuV6j+ka6ZCq75OJOYymedi6+cwYmwLWc2JLimZ33LEVPhebFYyGmDaNVc7bUY8uh+SNR/B4osYfzI0wwY8AHmimUmw9n0bz55IJ5fxc93b8s+CdhwJlgghANZfbM+18zeUFdB+HSgE50DeLezZQLzgtMYfW+AY+K/F/jOWE+wQRS8f1B9GIvh77bk+4V9b0e9dgiWtVa1Iba8SDrthfFOpXX6bFXLm51Wk9UmCGLqy/VYXVyR5YSBU0Zpq9PgIv7Qim7GGCCWperZxceblOuHT26UGv94NdcKMGm6bc212v/XHGhlO2Ybrlk3WEQTFmDrBd4EJajMGThcrw0kFa/ifmY4rrUG+jnTbsf7HHIOxfh2ECiHNek6XN1ef205hJn1g/tQHKwSgwWml/oKAwfTTeogxo2lFNr0XDkg8ANpQE154dxlFJuqFBKSTLDqOUNr2z7vhXISYxqQzshpO67bt9Zeav1qnNGfXKBcn04U1JX8C0cL8fLzwAAAP//skxNxGMDAAA","objectset.rio.cattle.io/id":"auth-prov-v2-roletemplate-","objectset.rio.cattle.io/owner-gvk":"management.cattle.io/v3, Kind=RoleTemplate","objectset.rio.cattle.io/owner-name":"nodes-manage","objectset.rio.cattle.io/owner-namespace":""},"ownerReferences":[{"apiVersion":"provisioning.cattle.io/v1","kind":"Cluster","name":"","uid":"1b13bbf4-c016-4583-bfda-73a53f1a3def"}],"managedFields":[{"manager":"rancher","operation":"Update","apiVersion":"rbac.authorization.k8s.io/v1","time":"2022-09-08T07:58:00Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:objectset.rio.cattle.io/applied":{},"f:objectset.rio.cattle.io/id":{},"f:objectset.rio.cattle.io/owner-gvk":{},"f:objectset.rio.cattle.io/owner-name":{},"f:objectset.rio.cattle.io/owner-namespace":{}},"f:labels":{".":{},"f:objectset.rio.cattle.io/hash":{}},"f:ownerReferences":{".":{},"k:{\"uid\":\"1b13bbf4-c016-4583-bfda-73a53f1a3def\"}":{}}},"f:rules":{}}}]},"rules":[{"verbs":["*"],"apiGroups":["cluster.x-k8s.io"],"resources":["machines"],"resourceNames":["custom-7808e68fb38f","custom-93d19d48b5b8","custom-1e0fad1a183a","custom-e8ac11599666","custom-3bbb6ed89c6f","custom-607703a1e39b","custom-4336c74424f6"]}]}] 2022/09/08 08:58:30 [DEBUG] DesiredSet - Updated rbac.authorization.k8s.io/v1, Kind=Role fleet-default/crt--nodes-manage for auth-prov-v2-roletemplate- nodes-manage -- application/strategic-merge-patch+json {"metadata":{"annotations":{"objectset.rio.cattle.io/applied":"H4sIAAAAAAAA/4xRTY/bIBD9K9UcK5Oa4NjYUk899FCph9WqlyqHAYYNXQwW4LTSKv+9IrtVrK36cYMH8+Z9PMFMBQ0WhOkJMIRYsLgYcr1G9Y10yVR2ycWdxlI87Vx85wxMgGs5sSXFMzvvWYqeCs2Lx0JMG0ar5mypx5ZD80ei+D1QYg/nR5hgxoAPNFMomw9n0bz55IJ5fxc93b8s+CdhwJlgghANZfbM+18zeUFdB+HSgE50DeLezZQLzgtMYfW+AY+K/F/jOWE+wQS94vuDGMS+l0O3Jz0oGgY96rFFtKq1qA2146Gv214U61Rep8deudjqtJ6oMEMWV1+qw+rkjiwlCpoyTF+fABf3hVJ2McAEtS5Xzy48bFOuHT26UGv94NdcKMFN029trtf+ueJCKdsx3fKedQcpmLIG2SDwICxHYcjC5XhpIK3+JuZjiutSb6CfN+1+sEeZdy7CsYFEOa5J0+fq8vppzSXOjFNr0XDkUiA0v1CSqDk/jGPf9zdUKKV6MnLUvb2hfTsMrUBOYlQ3tBOi10PX7Tu7YRhkK6mXVgm5YRiF4aPppDooudV61TmjPrlAuT6cKakr+BaOl+PlZwAAAP//1WFOc2MDAAA"}},"rules":[{"apiGroups":["cluster.x-k8s.io"],"resourceNames":["custom-1e0fad1a183a","custom-e8ac11599666","custom-3bbb6ed89c6f","custom-607703a1e39b","custom-4336c74424f6","custom-7808e68fb38f","custom-93d19d48b5b8"],"resources":["machines"],"verbs":["*"]}]} 2022/09/08 08:58:30 [DEBUG] DesiredSet - No change(2) /v1, Kind=ServiceAccount fleet-default/custom-7808e68fb38f-machine-plan for rke-machine fleet-default/custom-7808e68fb38f 2022/09/08 08:58:30 [DEBUG] [plansecret] reconciling secret fleet-default/custom-7808e68fb38f-machine-plan 2022/09/08 08:58:30 [DEBUG] [plansecret] fleet-default/custom-7808e68fb38f-machine-plan: rv: 32202835: Reconciling machine PlanApplied condition to nil 2022/09/08 08:58:30 [DEBUG] DesiredSet - No change(2) /v1, Kind=Secret fleet-default/custom-7808e68fb38f-machine-plan for rke-machine fleet-default/custom-7808e68fb38f 2022/09/08 08:58:30 [DEBUG] DesiredSet - No change(2) rbac.authorization.k8s.io/v1, Kind=Role fleet-default/custom-7808e68fb38f-machine-plan for rke-machine fleet-default/custom-7808e68fb38f 2022/09/08 08:58:30 [DEBUG] DesiredSet - No change(2) rbac.authorization.k8s.io/v1, Kind=RoleBinding fleet-default/custom-7808e68fb38f-machine-plan for rke-machine fleet-default/custom-7808e68fb38f 2022/09/08 08:58:30 [DEBUG] [CAPI] Reconciling 2022/09/08 08:58:30 [DEBUG] [CAPI] Cluster still exists 2022/09/08 08:58:30 [DEBUG] DesiredSet - Delete cluster.x-k8s.io/v1beta1, Kind=Cluster fleet-default/ for rke-cluster fleet-default/ 2022/09/08 08:58:30 [DEBUG] DesiredSet - Delete rke.cattle.io/v1, Kind=RKEControlPlane fleet-default/ for rke-cluster fleet-default/ 2022/09/08 08:58:30 [DEBUG] DesiredSet - Delete rke.cattle.io/v1, Kind=RKECluster fleet-default/ for rke-cluster fleet-default/ 2022/09/08 08:58:30 [DEBUG] [rkecontrolplane] (fleet-default/) Peforming removal of rkecontrolplane 2022/09/08 08:58:30 [DEBUG] [rkecontrolplane] (fleet-default/) listed 3 machines during removal 2022/09/08 08:58:30 [DEBUG] [UnmanagedMachine] Removing machine fleet-default/custom-607703a1e39b in cluster 2022/09/08 08:58:30 [DEBUG] [UnmanagedMachine] Safe removal for machine fleet-default/custom-607703a1e39b in cluster not necessary as it is not an etcd node ```

On RKE2 bootstrap node in rke2-server logs we can see this:

rke2-server logs on bootstrap node

``` Sep 07 11:41:57 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:41:57Z" level=info msg="Removing name=ip-yyy-yy-yy-yyy.eu-central-1.compute.internal-ee7ac07c id=1846382134098187668 address=172.28.74.196 from etcd" Sep 07 11:41:57 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:41:57Z" level=info msg="Removing name=ip-zzz-zz-zz-zzz.eu-central-1.compute.internal-bc3f1edb id=12710303601531451479 address=172.28.70.189 from etcd" Sep 07 11:42:10 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:42:10Z" level=info msg="Stopped tunnel to zzz.zz.zz.zzz:9345" Sep 07 11:42:10 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:42:10Z" level=info msg="Stopped tunnel to yyy.yy.yy.yyy:9345" Sep 07 11:42:10 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:42:10Z" level=info msg="Proxy done" err="context canceled" url="wss://yyy.yy.yy.yyy:9345/v1-rke2/connect" Sep 07 11:42:10 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:42:10Z" level=info msg="Proxy done" err="context canceled" url="wss://zzz.zz.zz.zzz:9345/v1-rke2/connect" ```

To Reproduce Unfortunately I can't reproduce this reliably but this happens very often. Steps I am using to reproduce this issue:

provision RKE2 cluster with terraform
modify additional_manifest for RKE2 cluster
apply change

Result Occasionally managed cluster gets deleted by rancher.

Expected Result Change is actually applied and clusters is not deleted.

I did some tests that does exactly the same change (modify additional_manifest) bypassing terraform by calling rancher API directly and that never caused cluster deletion for 2k+ iterations. While using terraform provider some times it takes up to 10 attempts to reproduce this issue.

I am happy to provide any other info to investigate this further. This is causing massive outages for my clusters as they are just getting destroyed.

snasovich commented 10 months ago

Per @jakefhyde to fix this we will need to switch to using Steve (v1) APIs from currently used Norman (v3) - or even to native k8s - both are pretty big undertakings so this may take a while to address especially since the immediate issue is now addressed on rancher/rancher since 2.7.5+.

a-blender commented 8 months ago

To clarify, TF removing finalizers is caused by using the Norman API. This cannot be fixed in the Norman API itself and that's why addressing the not-ideal cleanup will require an entire refactor to either Steve or native k8s. That work is being tracked here https://github.com/rancher/terraform-provider-rancher2/issues/1134 and completing this issue depends on it.

boldynnetwork commented 6 months ago

i have the same problem - Change in smallest variable/argument of rke2 tries to recereate cluster - instead it should config and Update the cluster.

riuvshyn commented 6 months ago

@boldynnetwork which Rancher version you are using? This issue was actually fixed in 2.7.5

weiyentan commented 3 months ago

@riuvshyn Happens to me on 2.8.1

weiyentan commented 3 months ago

although i upgraded to 4.10, seemed to be behaving

rancher / terraform-provider-rancher2

[BUG] Occasionally RKE2 cluster gets destroyed after cluster configuration is changed using terraform provider #993