[BUG] Occasionally RKE2 cluster gets destroyed after cluster configuration is changed using terraform provider

riuvshyn commented 1 year ago

Important: Please see https://github.com/rancher/terraform-provider-rancher2/issues/993#issuecomment-1611922983 on the status of this issue following completed investigations.

Rancher Server Setup

Rancher version: 2.6.8
Installation option (Docker install/Helm Chart):
- installed as helm chart
- running on k3s 1.24.4
Proxy/Cert Details: N/A

Information about the Cluster

Kubernetes version: 1.23.9
Cluster Type Downstream:
- Custom RKE2 v1.23.9+rke2r1 running on AWS
- Provisioned with Terraform provider rancher/rancher2 version 1.24.1

cluster configuration:

``` { "kubernetesVersion": "v1.23.9+rke2r1", "rkeConfig": { "upgradeStrategy": { "controlPlaneConcurrency": "1", "controlPlaneDrainOptions": { "enabled": false, "force": false, "ignoreDaemonSets": true, "IgnoreErrors": false, "deleteEmptyDirData": true, "disableEviction": false, "gracePeriod": 0, "timeout": 10800, "skipWaitForDeleteTimeoutSeconds": 600, "preDrainHooks": null, "postDrainHooks": null }, "workerConcurrency": "10%", "workerDrainOptions": { "enabled": false, "force": false, "ignoreDaemonSets": true, "IgnoreErrors": false, "deleteEmptyDirData": true, "disableEviction": false, "gracePeriod": 0, "timeout": 10800, "skipWaitForDeleteTimeoutSeconds": 600, "preDrainHooks": null, "postDrainHooks": null } }, "chartValues": null, "machineGlobalConfig": { "cloud-provider-name": "aws", "cluster-cidr": "100.64.0.0/13", "cluster-dns": "100.64.0.10", "cluster-domain": "cluster.local", "cni": "none", "disable": [ "rke2-ingress-nginx", "rke2-metrics-server", "rke2-canal" ], "disable-cloud-controller": false, "kube-apiserver-arg": [ "allow-privileged=true", "anonymous-auth=false", "feature-gates=CustomCPUCFSQuotaPeriod=true", "api-audiences=https://-oidc.s3.eu-central-1.amazonaws.com,https://kubernetes.default.svc.cluster.local,rke2", "audit-log-maxage=90", "audit-log-maxbackup=10", "audit-log-maxsize=500", "audit-log-path=/var/log/k8s-audit/audit.log", "audit-policy-file=/etc/kubernetes-/audit-policy.yaml", "authorization-mode=Node,RBAC", "bind-address=0.0.0.0", "enable-admission-plugins=PodSecurityPolicy,NodeRestriction", "event-ttl=1h", "kubelet-preferred-address-types=InternalIP,Hostname,ExternalIP", "profiling=false", "request-timeout=60s", "runtime-config=api/all=true", "service-account-key-file=/etc/kubernetes-wise/service-account.pub", "service-account-lookup=true", "service-account-issuer=https://-oidc.s3.eu-central-1.amazonaws.com", "service-account-signing-key-file=/etc/kubernetes-wise/service-account.key", "service-node-port-range=30000-32767", "shutdown-delay-duration=60s", "tls-min-version=VersionTLS12", "tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "tls-min-version=VersionTLS12", "v=2" ], "kube-apiserver-extra-mount": [ "/etc/kubernetes-wise:/etc/kubernetes-wise:ro", "/var/log/k8s-audit:/var/log/k8s-audit:rw" ], "kube-controller-manager-arg": [ "allocate-node-cidrs=true", "attach-detach-reconcile-sync-period=1m0s", "bind-address=0.0.0.0", "configure-cloud-routes=false", "feature-gates=CustomCPUCFSQuotaPeriod=true", "leader-elect=true", "node-monitor-grace-period=2m", "pod-eviction-timeout=220s", "profiling=false", "service-account-private-key-file=/etc/kubernetes-wise/service-account.key", "use-service-account-credentials=true", "terminated-pod-gc-threshold=12500", "tls-min-version=VersionTLS12", "tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "tls-min-version=VersionTLS12" ], "kube-controller-manager-extra-mount": [ "/etc/kubernetes-wise:/etc/kubernetes-wise:ro" ], "kube-proxy-arg": [ "conntrack-max-per-core=131072", "conntrack-tcp-timeout-close-wait=0s", "metrics-bind-address=0.0.0.0", "proxy-mode=iptables" ], "kube-scheduler-arg": [ "bind-address=0.0.0.0", "port=0", "secure-port=10259", "profiling=false", "leader-elect=true", "tls-min-version=VersionTLS12", "v=2", "tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "tls-min-version=VersionTLS12" ], "kubelet-arg": [ "network-plugin=cni", "cni-bin-dir=/opt/cni/bin/", "cni-conf-dir=/etc/cni/net.d/", "feature-gates=CustomCPUCFSQuotaPeriod=true", "config=/etc/kubernetes-wise/kubelet.yaml", "exit-on-lock-contention=true", "lock-file=/var/run/lock/kubelet.lock", "pod-infra-container-image=docker-k8s-gcr-io./pause:3.1", "register-node=true", "tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384", "tls-min-version=VersionTLS12", "v=4" ], "profile": "cis-1.6", "protect-kernel-defaults": true, "service-cidr": "100.64.0.0/13" }, "additionalManifest": "", "registries": { "mirrors": { "docker.io": { "endpoint": [ "" ] }, "gcr.io": { "endpoint": [ "" ] }, "k8s.gcr.io": { "endpoint": [ "" ] }, "quay.io": { "endpoint": [ "" ] } }, "configs": { "": {} } }, "etcd": { "snapshotScheduleCron": "0 */6 * * *", "snapshotRetention": 12, "s3": { "endpoint": "s3.eu-central-1.amazonaws.com", "bucket": "etcd-backups-", "region": "eu-central-1", "folder": "etcd" } } }, "localClusterAuthEndpoint": {}, "defaultClusterRoleForProjectMembers": "user", "enableNetworkPolicy": false } ```

Additional info:

I am using custom CNI: aws-vpc-cni installed via additional_manifest

Describe the bug Occasionally simple cluster configuration change for example change lables in manifests passed via additional_manifest applied with terraform provider causing managed RKE2 cluster to destroy.

terraform plan looks similar to this:

Terraform Plan output

``` Terraform will perform the following actions: # module.cluster.rancher2_cluster_v2.this will be updated in-place ~ resource "rancher2_cluster_v2" "this" { id = "fleet-default/o11y-euc1-se-main01" name = "o11y-euc1-se-main01" # (10 unchanged attributes hidden) ~ rke_config { ~ additional_manifest = <<-EOT --- apiVersion: v1 kind: Namespace metadata: labels: - test: test + test1: test1 name: my-namespace EOT } } ```

Sometimes once change like this is applied rancher immediately trying to delete that managed cluster for some reason. On UI it looks like this:

Rancher UI screenshot:

![image](https://user-images.githubusercontent.com/53786845/188906317-5724573f-f86d-4967-8f00-13adb5c51c3e.png)

Rancher logs:

rancher logs:

``` 2022/09/08 08:58:30 [DEBUG] [planner] rkecluster fleet-default/: unlocking 810235e7-ecc0-4ba7-81c8-55d778594926 2022/09/08 08:58:30 [INFO] [planner] rkecluster fleet-default/: waiting: configuring bootstrap node(s) custom-7808e68fb38f: waiting for plan to be applied 2022/09/08 08:58:30 [DEBUG] [CAPI] Cannot retrieve CRD with metadata only client, falling back to slower listing 2022/09/08 08:58:30 [DEBUG] DesiredSet - Patch rbac.authorization.k8s.io/v1, Kind=Role fleet-default/crt--nodes-manage for auth-prov-v2-roletemplate- nodes-manage -- [PATCH:{"metadata":{"annotations":{"objectset.rio.cattle.io/applied":"H4sIAAAAAAAA/4xRTY/bIBD9K9UcK5Oa4NjYUk899FCph9WqlyqHAYYNXQwW4LTSKv+9IrtVrK36cYMH8+Z9PMFMBQ0WhOkJMIRYsLgYcr1G9Y10yVR2ycWdxlI87Vx85wxMgGs5sSXFMzvvWYqeCs2Lx0JMG0ar5mypx5ZD80ei+D1QYg/nR5hgxoAPNFMomw9n0bz55IJ5fxc93b8s+CdhwJlgghANZfbM+18zeUFdB+HSgE50DeLezZQLzgtMYfW+AY+K/F/jOWE+wQS94vuDGMS+l0O3Jz0oGgY96rFFtKq1qA2146Gv214U61Rep8deudjqtJ6oMEMWV1+qw+rkjiwlCpoyTF+fABf3hVJ2McAEtS5Xzy48bFOuHT26UGv94NdcKMFN029trtf+ueJCKdsx3fKedQcpmLIG2SDwICxHYcjC5XhpIK3+JuZjiutSb6CfN+1+sEeZdy7CsYFEOa5J0+fq8vppzSXOjFNr0XDkUiA0v1CSqDk/jGPf9zdUKKV6MnLUvb2hfTsMrUBOYlQ3tBOi10PX7Tu7YRhkK6mXVgm5YRiF4aPppDooudV61TmjPrlAuT6cKakr+BaOl+PlZwAAAP//1WFOc2MDAAA"}},"rules":[{"apiGroups":["cluster.x-k8s.io"],"resourceNames":["custom-1e0fad1a183a","custom-e8ac11599666","custom-3bbb6ed89c6f","custom-607703a1e39b","custom-4336c74424f6","custom-7808e68fb38f","custom-93d19d48b5b8"],"resources":["machines"],"verbs":["*"]}]}, ORIGINAL:{"apiVersion":"rbac.authorization.k8s.io/v1","kind":"Role","metadata":{"annotations":{"objectset.rio.cattle.io/applied":"H4sIAAAAAAAA/4yRT4/bLBDGv8qrOb4KqQk2xpZ66qGHSj2sql6qHAYYNnRtsACnlaJ894rsVo626p8bPDDP/GaeC8xU0GJBGC+AIcSCxceQ6zXqr2RKprJPPu4NljLR3sc33sIIuJYTW1I8s/OBpThRoXmZsBAzltFqOFvqseGw+61R/BYoscfzE4wwY8BHmimUuw9nsfvvgw/27UOc6NNLg78aBpwJRgjRUmbPvv9Ukxc0tRCuO5hQ0/THLZwwn2AEqfmhE704SNW3BzK9pr43gxkaRKcbh8ZSM3Symr6AmVReL4m9gr3HcRNRYZYcrlOpg1TgB3KUKBjKMH65AC7+M6XsY4ARaiq+nn14vF9mjeLJh5reu2nNhRJsTL+Ett5i5poLrV3LTMMlazslmHYWWS+wE46jsOTgerzuIK3TBvM+xXWpNzDPnfbf2ZPKex/huINEOa7J0Mc65e3TmkucWa8aRVI5LZSD3U91EJYPtlW602pTOTUOLUeuBG4qKTScd8MgpdxUobWWZNVg5J2vbPq+EchJDHpTWyGk6dv20Dp5z3rjnNGcfKBcH86U9E38H47X4/VHAAAA///VWUNFSgMAAA","objectset.rio.cattle.io/id":"auth-prov-v2-roletemplate-","objectset.rio.cattle.io/owner-gvk":"management.cattle.io/v3, Kind=RoleTemplate","objectset.rio.cattle.io/owner-name":"nodes-manage","objectset.rio.cattle.io/owner-namespace":""},"labels":{"objectset.rio.cattle.io/hash":"6b125373268742ec7be77c9c90aafb0facde0956"},"name":"crt--nodes-manage","namespace":"fleet-default","ownerReferences":[{"apiVersion":"provisioning.cattle.io/v1","kind":"Cluster","name":"","uid":"1b13bbf4-c016-4583-bfda-73a53f1a3def"}]},"rules":[{"apiGroups":["cluster.x-k8s.io"],"resourceNames":["custom-7808e68fb38f","custom-93d19d48b5b8","custom-1e0fad1a183a","custom-e8ac11599666","custom-3bbb6ed89c6f","custom-607703a1e39b","custom-4336c74424f6"],"resources":["machines"],"verbs":["*"]}]}, MODIFIED:{"kind":"Role","apiVersion":"rbac.authorization.k8s.io/v1","metadata":{"name":"crt--nodes-manage","namespace":"fleet-default","creationTimestamp":null,"labels":{"objectset.rio.cattle.io/hash":"6b125373268742ec7be77c9c90aafb0facde0956"},"annotations":{"objectset.rio.cattle.io/applied":"H4sIAAAAAAAA/4xRTY/bIBD9K9UcK5Oa4NjYUk899FCph9WqlyqHAYYNXQwW4LTSKv+9IrtVrK36cYMH8+Z9PMFMBQ0WhOkJMIRYsLgYcr1G9Y10yVR2ycWdxlI87Vx85wxMgGs5sSXFMzvvWYqeCs2Lx0JMG0ar5mypx5ZD80ei+D1QYg/nR5hgxoAPNFMomw9n0bz55IJ5fxc93b8s+CdhwJlgghANZfbM+18zeUFdB+HSgE50DeLezZQLzgtMYfW+AY+K/F/jOWE+wQS94vuDGMS+l0O3Jz0oGgY96rFFtKq1qA2146Gv214U61Rep8deudjqtJ6oMEMWV1+qw+rkjiwlCpoyTF+fABf3hVJ2McAEtS5Xzy48bFOuHT26UGv94NdcKMFN029trtf+ueJCKdsx3fKedQcpmLIG2SDwICxHYcjC5XhpIK3+JuZjiutSb6CfN+1+sEeZdy7CsYFEOa5J0+fq8vppzSXOjFNr0XDkUiA0v1CSqDk/jGPf9zdUKKV6MnLUvb2hfTsMrUBOYlQ3tBOi10PX7Tu7YRhkK6mXVgm5YRiF4aPppDooudV61TmjPrlAuT6cKakr+BaOl+PlZwAAAP//1WFOc2MDAAA","objectset.rio.cattle.io/id":"auth-prov-v2-roletemplate-","objectset.rio.cattle.io/owner-gvk":"management.cattle.io/v3, Kind=RoleTemplate","objectset.rio.cattle.io/owner-name":"nodes-manage","objectset.rio.cattle.io/owner-namespace":""},"ownerReferences":[{"apiVersion":"provisioning.cattle.io/v1","kind":"Cluster","name":"","uid":"1b13bbf4-c016-4583-bfda-73a53f1a3def"}]},"rules":[{"verbs":["*"],"apiGroups":["cluster.x-k8s.io"],"resources":["machines"],"resourceNames":["custom-1e0fad1a183a","custom-e8ac11599666","custom-3bbb6ed89c6f","custom-607703a1e39b","custom-4336c74424f6","custom-7808e68fb38f","custom-93d19d48b5b8"]}]}, CURRENT:{"kind":"Role","apiVersion":"rbac.authorization.k8s.io/v1","metadata":{"name":"crt--nodes-manage","namespace":"fleet-default","uid":"2d3678e7-1904-442f-bfa6-ef4ad97baa40","resourceVersion":"32202831","creationTimestamp":"2022-09-08T07:56:23Z","labels":{"objectset.rio.cattle.io/hash":"6b125373268742ec7be77c9c90aafb0facde0956"},"annotations":{"objectset.rio.cattle.io/applied":"H4sIAAAAAAAA/4xRS48UIRD+K6aOphmboYd+JJ48eDDxsNl4MXMooNjBpaED9Giymf9umF3TkzU+bvBBffU9nmCmggYLwvQEGEIsWFwMuV6j+ka6ZCq75OJOYymedi6+cwYmwLWc2JLimZ33LEVPhebFYyGmDaNVc7bUY8uh+SNR/B4osYfzI0wwY8AHmimUmw9n0bz55IJ5fxc93b8s+CdhwJlgghANZfbM+18zeUFdB+HSgE50DeLezZQLzgtMYfW+AY+K/F/jOWE+wQRS8f1B9GIvh77bk+4V9b0e9dgiWtVa1Iba8SDrthfFOpXX6bFXLm51Wk9UmCGLqy/VYXVyR5YSBU0Zpq9PgIv7Qim7GGCCWperZxceblOuHT26UGv94NdcKMGm6bc212v/XHGhlO2Ybrlk3WEQTFmDrBd4EJajMGThcrw0kFa/ifmY4rrUG+jnTbsf7HHIOxfh2ECiHNek6XN1ef205hJn1g/tQHKwSgwWml/oKAwfTTeogxo2lFNr0XDkg8ANpQE154dxlFJuqFBKSTLDqOUNr2z7vhXISYxqQzshpO67bt9Zeav1qnNGfXKBcn04U1JX8C0cL8fLzwAAAP//skxNxGMDAAA","objectset.rio.cattle.io/id":"auth-prov-v2-roletemplate-","objectset.rio.cattle.io/owner-gvk":"management.cattle.io/v3, Kind=RoleTemplate","objectset.rio.cattle.io/owner-name":"nodes-manage","objectset.rio.cattle.io/owner-namespace":""},"ownerReferences":[{"apiVersion":"provisioning.cattle.io/v1","kind":"Cluster","name":"","uid":"1b13bbf4-c016-4583-bfda-73a53f1a3def"}],"managedFields":[{"manager":"rancher","operation":"Update","apiVersion":"rbac.authorization.k8s.io/v1","time":"2022-09-08T07:58:00Z","fieldsType":"FieldsV1","fieldsV1":{"f:metadata":{"f:annotations":{".":{},"f:objectset.rio.cattle.io/applied":{},"f:objectset.rio.cattle.io/id":{},"f:objectset.rio.cattle.io/owner-gvk":{},"f:objectset.rio.cattle.io/owner-name":{},"f:objectset.rio.cattle.io/owner-namespace":{}},"f:labels":{".":{},"f:objectset.rio.cattle.io/hash":{}},"f:ownerReferences":{".":{},"k:{\"uid\":\"1b13bbf4-c016-4583-bfda-73a53f1a3def\"}":{}}},"f:rules":{}}}]},"rules":[{"verbs":["*"],"apiGroups":["cluster.x-k8s.io"],"resources":["machines"],"resourceNames":["custom-7808e68fb38f","custom-93d19d48b5b8","custom-1e0fad1a183a","custom-e8ac11599666","custom-3bbb6ed89c6f","custom-607703a1e39b","custom-4336c74424f6"]}]}] 2022/09/08 08:58:30 [DEBUG] DesiredSet - Updated rbac.authorization.k8s.io/v1, Kind=Role fleet-default/crt--nodes-manage for auth-prov-v2-roletemplate- nodes-manage -- application/strategic-merge-patch+json {"metadata":{"annotations":{"objectset.rio.cattle.io/applied":"H4sIAAAAAAAA/4xRTY/bIBD9K9UcK5Oa4NjYUk899FCph9WqlyqHAYYNXQwW4LTSKv+9IrtVrK36cYMH8+Z9PMFMBQ0WhOkJMIRYsLgYcr1G9Y10yVR2ycWdxlI87Vx85wxMgGs5sSXFMzvvWYqeCs2Lx0JMG0ar5mypx5ZD80ei+D1QYg/nR5hgxoAPNFMomw9n0bz55IJ5fxc93b8s+CdhwJlgghANZfbM+18zeUFdB+HSgE50DeLezZQLzgtMYfW+AY+K/F/jOWE+wQS94vuDGMS+l0O3Jz0oGgY96rFFtKq1qA2146Gv214U61Rep8deudjqtJ6oMEMWV1+qw+rkjiwlCpoyTF+fABf3hVJ2McAEtS5Xzy48bFOuHT26UGv94NdcKMFN029trtf+ueJCKdsx3fKedQcpmLIG2SDwICxHYcjC5XhpIK3+JuZjiutSb6CfN+1+sEeZdy7CsYFEOa5J0+fq8vppzSXOjFNr0XDkUiA0v1CSqDk/jGPf9zdUKKV6MnLUvb2hfTsMrUBOYlQ3tBOi10PX7Tu7YRhkK6mXVgm5YRiF4aPppDooudV61TmjPrlAuT6cKakr+BaOl+PlZwAAAP//1WFOc2MDAAA"}},"rules":[{"apiGroups":["cluster.x-k8s.io"],"resourceNames":["custom-1e0fad1a183a","custom-e8ac11599666","custom-3bbb6ed89c6f","custom-607703a1e39b","custom-4336c74424f6","custom-7808e68fb38f","custom-93d19d48b5b8"],"resources":["machines"],"verbs":["*"]}]} 2022/09/08 08:58:30 [DEBUG] DesiredSet - No change(2) /v1, Kind=ServiceAccount fleet-default/custom-7808e68fb38f-machine-plan for rke-machine fleet-default/custom-7808e68fb38f 2022/09/08 08:58:30 [DEBUG] [plansecret] reconciling secret fleet-default/custom-7808e68fb38f-machine-plan 2022/09/08 08:58:30 [DEBUG] [plansecret] fleet-default/custom-7808e68fb38f-machine-plan: rv: 32202835: Reconciling machine PlanApplied condition to nil 2022/09/08 08:58:30 [DEBUG] DesiredSet - No change(2) /v1, Kind=Secret fleet-default/custom-7808e68fb38f-machine-plan for rke-machine fleet-default/custom-7808e68fb38f 2022/09/08 08:58:30 [DEBUG] DesiredSet - No change(2) rbac.authorization.k8s.io/v1, Kind=Role fleet-default/custom-7808e68fb38f-machine-plan for rke-machine fleet-default/custom-7808e68fb38f 2022/09/08 08:58:30 [DEBUG] DesiredSet - No change(2) rbac.authorization.k8s.io/v1, Kind=RoleBinding fleet-default/custom-7808e68fb38f-machine-plan for rke-machine fleet-default/custom-7808e68fb38f 2022/09/08 08:58:30 [DEBUG] [CAPI] Reconciling 2022/09/08 08:58:30 [DEBUG] [CAPI] Cluster still exists 2022/09/08 08:58:30 [DEBUG] DesiredSet - Delete cluster.x-k8s.io/v1beta1, Kind=Cluster fleet-default/ for rke-cluster fleet-default/ 2022/09/08 08:58:30 [DEBUG] DesiredSet - Delete rke.cattle.io/v1, Kind=RKEControlPlane fleet-default/ for rke-cluster fleet-default/ 2022/09/08 08:58:30 [DEBUG] DesiredSet - Delete rke.cattle.io/v1, Kind=RKECluster fleet-default/ for rke-cluster fleet-default/ 2022/09/08 08:58:30 [DEBUG] [rkecontrolplane] (fleet-default/) Peforming removal of rkecontrolplane 2022/09/08 08:58:30 [DEBUG] [rkecontrolplane] (fleet-default/) listed 3 machines during removal 2022/09/08 08:58:30 [DEBUG] [UnmanagedMachine] Removing machine fleet-default/custom-607703a1e39b in cluster 2022/09/08 08:58:30 [DEBUG] [UnmanagedMachine] Safe removal for machine fleet-default/custom-607703a1e39b in cluster not necessary as it is not an etcd node ```

On RKE2 bootstrap node in rke2-server logs we can see this:

rke2-server logs on bootstrap node

``` Sep 07 11:41:57 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:41:57Z" level=info msg="Removing name=ip-yyy-yy-yy-yyy.eu-central-1.compute.internal-ee7ac07c id=1846382134098187668 address=172.28.74.196 from etcd" Sep 07 11:41:57 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:41:57Z" level=info msg="Removing name=ip-zzz-zz-zz-zzz.eu-central-1.compute.internal-bc3f1edb id=12710303601531451479 address=172.28.70.189 from etcd" Sep 07 11:42:10 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:42:10Z" level=info msg="Stopped tunnel to zzz.zz.zz.zzz:9345" Sep 07 11:42:10 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:42:10Z" level=info msg="Stopped tunnel to yyy.yy.yy.yyy:9345" Sep 07 11:42:10 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:42:10Z" level=info msg="Proxy done" err="context canceled" url="wss://yyy.yy.yy.yyy:9345/v1-rke2/connect" Sep 07 11:42:10 ip-xxx-xx-xx-xxx rke2[1080]: time="2022-09-07T11:42:10Z" level=info msg="Proxy done" err="context canceled" url="wss://zzz.zz.zz.zzz:9345/v1-rke2/connect" ```

To Reproduce Unfortunately I can't reproduce this reliably but this happens very often. Steps I am using to reproduce this issue:

provision RKE2 cluster with terraform
modify additional_manifest for RKE2 cluster
apply change

Result Occasionally managed cluster gets deleted by rancher.

Expected Result Change is actually applied and clusters is not deleted.

I did some tests that does exactly the same change (modify additional_manifest) bypassing terraform by calling rancher API directly and that never caused cluster deletion for 2k+ iterations. While using terraform provider some times it takes up to 10 attempts to reproduce this issue.

I am happy to provide any other info to investigate this further. This is causing massive outages for my clusters as they are just getting destroyed.

weiyentan commented 1 year ago

I am a bit confused. Are you building the local rancher through terraform and changing that destroys the downstream managed clusters? I always thought that if you have a managed rancher downstream cluster that you wouldn't need to manage by terraform and you would just manage through rancher directly....

riuvshyn commented 1 year ago

hello, I am using rancher2_cluster_v2 resource to provision and manage downstream rke2 clusters and sometimes when I change annotation or label on that terraform resource and apply it, it destroys my clusters I use terraform for gitOps so my cluster configuration is defined as code.

weiyentan commented 1 year ago

So is terraform destroying your cluster because from what I can see its rancher that is destroying them. Does the terraform rancher rke cluster import the cluster into rancher (is it imported?) because from what I can see based on your logs. That cluster is managed by rancher cluster management tool, hence it appears in the provisioning logs.

I thought that the terraform resource creates a separate cluster that is standalone and then you import it into rancher.

If it is created by rancher provisioning tool then rancher does do self healing on the nodes.

riuvshyn commented 1 year ago

a bit more info I've posted there: https://github.com/rancher/rancher/issues/38833

so I am provisioning custom cluster where with terraform I create rke2 cluster in rancher and then pass join command from it to my EC2 boxes provisioned with the same terraform code.

It is hard to say if this is terraform provider destroys cluster or rancher, but what I am doing is this:

update rancher2_cluster_v2 resource change labels for example
run terraform apply sometimes change is just applied, sometimes clusters gets deleted...

weiyentan commented 1 year ago

What do you mean by join command? The fact that you can see it being destroyed in the rancher ui means that it is not terraform destroying it. It's rancher. It sounds like you have joined the rke2 cluster (not quite sure how after terraform) and then when you change it through terraform, rancher fleet notices a change and then remediates the change by deleting and stands up the vm's again. Once rancher controls the vms in its cluster management (not importing) it's not just practice to manage the devices outside of it because rancher will try and self heal

riuvshyn commented 1 year ago

join command (node command): rancher2_cluster_v2.this.cluster_registration_token.node_command

The fact that you can see it being destroyed in the rancher ui means that it is not terraform destroying it. It's rancher. well that's correct it is rancher but what I mean is that it is not clear if terraform provider requested deletion or some bug in rancher caused this.

so I yeah i do not use machine_pools, I create rke2 clsuter with rancher2_cluster_v2 resouorce and then join EC2 nodes to the cluster separately. Everything works just fine, but sometimes when I change my cluster configuration and apply it cluster gets deleted.

weiyentan commented 1 year ago

What about trying to using terraform to import the cluster, not join it in terraform? Is there an option for that? It maybe in a race condition when using terraform. This way when you import terraform is the only one managing the cluster not rancher. The screenshots that you gave is rancher cluster management deleting the cluster not terraform.

If you import the terraform created cluster rancher won't manage the nodes and you remove that issue is rancher deleting the nodes.

riuvshyn commented 1 year ago

I think that resource rancher2_cluster_v2 is not reliable and it doesn't matter if it is custom cluster or not, since I am able to reproduce this issue with simle lable change on that resource when it applied cluster got deleted.

weiyentan commented 1 year ago

The thing is though the below screen shot is NOT terraform. It's rancher deleting the nodes. Terraform has nothing to do with it

riuvshyn commented 1 year ago

yeah, you are right, but the change is applied via terraform, I was running tests doing the same change to the same cluster (initially provisioned by terraofrm) but via rancher http API and also via k8s Cluster resource and that works just fine for ~2k+ iterations which makes me believe that cluster configuration is fine but when any change is applied via terraform on rancher2_cluster_v2 resource which then somehow cause this issue maybe terraform provider is using not reliable API to modify cluster which does something weird.

weiyentan commented 1 year ago

Well perhaps. Perhaps there is not something in the subsequent terraform activity that causes fleet in rancher to go "hey we need to rectify a potential issue with the cluster " whereas when you interact with it via kubectl and naturally through rancher it does not. Because in this instance your build joined it to the rancher cluster which means any subsequent changes should be managed by rancher.

If you want to manage through terraform I would suggest in your resource do not join the cluster to rancher. Instead build a standalone cluster and -import- that cluster into rancher, then that you can manage the rke2 cluster through terraform without rancher fleet getting in the way and trying to remediate the nodes.

riuvshyn commented 1 year ago

I should try to reproduce this with just dummy rancher2_cluster_v2 without even nodes I have feeling it will have the same issue.

weiyentan commented 1 year ago

Not following the logic there. What would you get if the if you have a cluster with no nodes?

riuvshyn commented 1 year ago

Well if this issue will be reproduced that would clearly highlight that there is a bug in provider or API it uses. obviously that is not going to be usable cluster but still is that would be expected if even dummy cluster will be deleted after label change?

weiyentan commented 1 year ago

Well if the cluster is managed by rancher in that fashion and you change it via terraform it will most likely cause that error because there is a discrepancy. When you do your test make sure you Import the downstream cluster instead.

That way rancher is managing the work loads of k8s and not the whole kubernetes infrastructure

riuvshyn commented 1 year ago

hmm, I am abit confused. what is the purpose then of rancher2_cluster_v2 resource?

weiyentan commented 1 year ago

Ok. Seems like we are getting into a bit of misunderstanding. This resource is to build a rancher app cluster or to build standalone rke2 cluster? If it's rancher are you defining the downstream clusters in the rancher terraform resource?

riuvshyn commented 1 year ago

I want to use rancher2_cluster_v2 resource to build rancher managed rke2 cluster. Once the cluster is created then I am getting node-command from that resource like that rancher2_cluster_v2.this.cluster_registration_token[0].node_command and passing it to my EC2s in user-data so when instances gets provisioned they join the cluster.

weiyentan commented 1 year ago

Ok. I am looking at the example in the resource at

https://registry.terraform.io/providers/rancher/rancher2/latest/docs/resources/cluster_v2

I can see that is adding as a cluster for rancher to manage not to import based on this block:


rke_config {
    machine_pools {
      name = "pool1"
      cloud_credential_secret_name = rancher2_cloud_credential.foo.id
      control_plane_role = true
      etcd_role = true
      worker_role = true
      quantity = 1
      machine_config {
        kind = rancher2_machine_config_v2.foo.kind
        name = rancher2_machine_config_v2.foo.name
      }
    }

That means you are adding it for terraform to do the initial add but once it is built by right because rancher fleet is managing it any changes outside rancher fleet is going to naturally change it back to what it was which makes sense. this use case would be to initially set up the new cluster. From then on at this stage I think any changes it is best managed by rancher.

I can see that this resource is in tech preview so this maybe something that could potentially be overlooked. Based on my experience with fleet and it's disposable nature what we are experiencing here is a race condition between terraform changes and rancher fleet trying to remediate the changes that terraform made.

Perhaps a bug after all and something that can be looked at.

Martin-Weiss commented 1 year ago

@weiyentan - where do you see that this resource is in tech preview? I thought this is GA since a while?

weiyentan commented 1 year ago

@weiyentan - where do you see that this resource is in tech preview? I thought this is GA since a while?

Provides a Rancher v2 Cluster v2 resource. This can be used to create RKE2 and K3S Clusters for Rancher v2 environments and retrieve their information. This resource is supported as tech preview from Rancher v2.6.0 and above.

In the url

Martin-Weiss commented 1 year ago

This seems to be outdated information. RKE2 provisioning is GA since a long time.. I believe with 2.6.3 it got GA. Just K3S seems to be tech preview, still.

weiyentan commented 1 year ago

That's good to know. The race condition still remains. Rancher cluster management is still recreating the cluster when terraform makes subsequent changes. My main point was that once the cluster is managed by rancher, only manage it by rancher.

I have clusters that are managed by ansible and terraform and in those cases I import the cluster into rancher. I don't get the issue of rancher recreating the cluster at all.

riuvshyn commented 1 year ago

@weiyentan

That means you are adding it for terraform to do the initial add but once it is built by right because rancher fleet is managing it any changes outside rancher fleet is going to naturally change it back to what it was which makes sense. this use case would be to initially set up the new cluster. From then on at this stage I think any changes it is best managed by rancher.

machine_pools is supposed to be optional as per documentation: machine_pools - (Optional/Computed) Cluster V2 machine pools (list)

So my expectation is that it must respect machines added to the cluster outside of terraform rancher2_cluster_v2 resource and that is why I have opened this ticket.

For now as an workaround I am creating Cluster object using kubernetes terraform provider like this:

resource "kubernetes_manifest" "rancher_cluster" {
  manifest = {
    "apiVersion" = "provisioning.cattle.io/v1"
    "kind"       = "Cluster"
    "metadata" = {...}
    "spec" = {...}
  }
}

I am still not setting machine pools and this works without any issues, I did similar tests and rancher successfully reconciled ~2k changes to the cluster configuration without any issues.

The reason why I don't want to use machine pools is that I really want to use aws ASGs which are not supported atm from what I can see.

jakefhyde commented 1 year ago

@riuvshyn Can you attach a portion of your terraform files, specifically how you are defining your rancher2_cluster_v2 resource, and the modifications you are making to it (both in and out of terraform)? I've modified the additional_manifest section of a custom cluster 5 times now, and I haven't been able to reproduce this so far. Additionally, custom clusters (and imported clusters) do not use the machine_pools resource, which is why it is optional. The machine_pools field defines infrastructure provisioned by rancher, so if you are mixing custom clusters with node provisioning you're going to quickly run into issues.

riuvshyn commented 1 year ago

@jakefhyde thanks for looking in to this. Sometimes it takes up to 20 iterations or more to reproduce.

here is my rancher2_cluster_v2:

I've dropped additional_manifest as I was able to reproduce this issue with just changing labels with something like this:

"test1" = formatdate("DD-MMM-YYYY-hh-mm-ss", timestamp())

so each iteration it gets new value.

I am not 100% sure it is actually wants to delete cluster object so not sure if this can be reproduces without machines.

NOTE: in additional_manifest I am passing AWS VPC CNI manifests, so that is why I have set cni: none but I don't think this is related to the issue.

resource "rancher2_cluster_v2" "this" {
  name = "test"
  enable_network_policy = false
  kubernetes_version = "v1.23.9+rke2r1"
  default_cluster_role_for_project_members = "user"
  labels = {
    "provider.cattle.io"             = "rke2"
    "test1" = formatdate("DD-MMM-YYYY-hh-mm-ss", timestamp())
    "tets2" = formatdate("DD-MMM-YYYY-hh-mm-ss", timestamp())
    "tets3" = formatdate("DD-MMM-YYYY-hh-mm-ss", timestamp())
  }

  annotations = {
    "ui.rancher/badge-color" : "#ffb619"
    "ui.rancher/badge-icon-text" : "T"
    "ui.rancher/badge-text" : "TEST"
  }

  rke_config {
    additional_manifest = "..."
    etcd {
      disable_snapshots      = false
      snapshot_schedule_cron = "0 */6 * * *" # every 6h
      snapshot_retention     = 12

      s3_config {
        bucket   = "my-backups"
        endpoint = "s3.eu-central-1.amazonaws.com"
        folder   = "etcd"
        region   = "eu-central-1"
      }
    }

    machine_global_config = <<EOF
    # CIS profile
    profile: cis-1.6
    protect-kernel-defaults: true
    cluster-cidr: 100.64.0.0/13
    cluster-dns: 100.64.0.10
    cluster-domain: cluster.local
    cni: none
    service-cidr: 100.64.0.0/13
    disable-cloud-controller: false
    cloud-provider-name: aws
    kube-apiserver-arg:
      - allow-privileged=true
      - anonymous-auth=false
      - feature-gates=${join(",", ["CustomCPUCFSQuotaPeriod=true"])}
      - api-audiences=https://${module.oidc.oidc_issuer_domain},https://kubernetes.default.svc.cluster.local,rke2
      - audit-log-maxage=90
      - audit-log-maxbackup=10
      - audit-log-maxsize=500
      - audit-log-path=/var/log/k8s-audit/audit.log
      - audit-policy-file=/etc/kubernetes-test/audit-policy.yaml
      - authorization-mode=Node,RBAC
      - bind-address=0.0.0.0
      - enable-admission-plugins=PodSecurityPolicy,NodeRestriction
      - event-ttl=1h
      - kubelet-preferred-address-types=InternalIP,Hostname,ExternalIP
      - profiling=false
      - request-timeout=60s
      - runtime-config=api/all=true
      - service-account-key-file=/etc/kubernetes-test/service-account.pub
      - service-account-lookup=true
      - service-account-issuer=https://${module.oidc.oidc_issuer_domain}
      - service-account-signing-key-file=/etc/kubernetes-test/service-account.key
      - service-node-port-range=30000-32767
      - shutdown-delay-duration=60s
      - tls-min-version=VersionTLS12
      - tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
      - tls-min-version=VersionTLS12
      - v=2
    kube-apiserver-extra-mount:
      - "/etc/kubernetes-test:/etc/kubernetes-test:ro"
      - "/var/log/k8s-audit:/var/log/k8s-audit:rw"
    kubelet-arg:
      - network-plugin=cni
      - cni-bin-dir=/opt/cni/bin/
      - cni-conf-dir=/etc/cni/net.d/
      - feature-gates=${join(",", ["CustomCPUCFSQuotaPeriod=true"])}
      - config=/etc/kubernetes-test/kubelet.yaml
      - exit-on-lock-contention=true
      - lock-file=/var/run/lock/kubelet.lock
      - pod-infra-container-image=rancher/pause:3.1
      - register-node=true
      - tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
      - tls-min-version=VersionTLS12
      - v=4
    kube-controller-manager-arg:
      - allocate-node-cidrs=true
      - attach-detach-reconcile-sync-period=1m0s
      - bind-address=0.0.0.0
      - configure-cloud-routes=false
      - feature-gates=${join(",", ["CustomCPUCFSQuotaPeriod=true"])}
      - leader-elect=true
      - node-monitor-grace-period=2m
      - pod-eviction-timeout=220s
      - profiling=false
      - service-account-private-key-file=/etc/kubernetes-test/service-account.key
      - use-service-account-credentials=true
      - terminated-pod-gc-threshold=12500
      - tls-min-version=VersionTLS12
      - tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
      - tls-min-version=VersionTLS12
    kube-controller-manager-extra-mount:
      - "/etc/kubernetes-test:/etc/kubernetes-test:ro"
    kube-proxy-arg:
      - conntrack-max-per-core=131072
      - conntrack-tcp-timeout-close-wait=0s
      - metrics-bind-address=0.0.0.0
      - proxy-mode=iptables
    kube-scheduler-arg:
      - bind-address=0.0.0.0
      - port=0
      - secure-port=10259
      - profiling=false
      - leader-elect=true
      - tls-min-version=VersionTLS12
      - v=2
      - tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
      - tls-min-version=VersionTLS12
    disable:
      - rke2-ingress-nginx
      - rke2-metrics-server
      - rke2-canal
    EOF

    upgrade_strategy {
      control_plane_concurrency = "1"
      worker_concurrency = "10%"
      control_plane_drain_options {
        enabled                              = false
        force                                = false
        ignore_daemon_sets                   = true
        ignore_errors                        = false
        delete_empty_dir_data                = true
        disable_eviction                     = false
        grace_period                         = -1
        timeout                              = 10800
        skip_wait_for_delete_timeout_seconds = 600
      }
      worker_drain_options {
        enabled                              = false
        force                                = false
        ignore_daemon_sets                   = true
        ignore_errors                        = false
        delete_empty_dir_data                = true
        disable_eviction                     = false
        grace_period                         = -1
        timeout                              = 10800
        skip_wait_for_delete_timeout_seconds = 600
      }
    }
  }
}

jakefhyde commented 1 year ago

@riuvshyn Thanks for updating, are there any other modifications you're making/have made to the cluster within rancher that are not reflect/were not applied within terraform? I've created a cluster with the following labels:

  labels                                   = {
    "provider.cattle.io" = "rke2"
    "test1"              = formatdate("DD-MMM-YYYY-hh-mm-ss", timestamp())
    "test2"              = formatdate("DD-MMM-YYYY-hh-mm-ss", timestamp())
    "test3"              = formatdate("DD-MMM-YYYY-hh-mm-ss", timestamp())
  }

and set it up to continuously terraform apply, after ~200 attempts I haven't been able to reproduce the issue.

riuvshyn commented 1 year ago

@jakefhyde nope, that was enough for me to reproduce it. do you have machines in your cluster? It might be related to the fact that I am not using machine_pools and my nodes join cluster with nodeCommand

I am not sure if it is actually deleting cluster but it wanted to kill all machines, and then stuck on the bootstrap node which was in deleting state but I guess blocked by finalizer.

jakefhyde commented 1 year ago

@riuvshyn I was using a custom cluster as well. From what I can tell in the Rancher UI screenshot you shared, the cluster itself isn't being deleted since it still says active, but it does look like the machines were being destroyed. I think that there is probably more at play here, does this only happen with AWS ASGs?

riuvshyn commented 1 year ago

@jakefhyde yeah... I never got cluster deleted as it was never deleted the bootstrap machine for some reason it got stuck on it. however in rancher logs every time I saw this:

2022/09/08 08:58:30 [DEBUG] DesiredSet - Delete cluster.x-k8s.io/v1beta1, Kind=Cluster fleet-default/<REDACTED> for rke-cluster fleet-default/<REDACTED>
2022/09/08 08:58:30 [DEBUG] DesiredSet - Delete rke.cattle.io/v1, Kind=RKEControlPlane fleet-default/<REDACTED> for rke-cluster fleet-default/<REDACTED>
2022/09/08 08:58:30 [DEBUG] DesiredSet - Delete rke.cattle.io/v1, Kind=RKECluster fleet-default/<REDACTED> for rke-cluster fleet-default/<REDACTED>

I don't know for sure if that is clear indication that it is actually about to delete cluster or not...

re ASGs, yes that is the only configuration I have and I have never tried to use anything else. Just to make it clear ASGs are managed separately by terraform not rancher, so I just inject joinCommand into user-data so it can join the cluster when node started.

Mybe this is also somehow related to custom CNI (I am using AWS VPC CNI installed via additional_manifests)? or CIS enabled...

riuvshyn commented 1 year ago

UPD: sometimes it is really hard to reproduce, yesterday I wanted to reproduce this issue again and that didn't happen for ~1k iterations and today I have tried it again and it took ~50 iterations to reproduce... I don't know if that is related or not but I have added 100 ConfigMap objects in additional_manifest and each iteration they got new random names. This looks like some weird race condition

riuvshyn commented 1 year ago

@jakefhyde I think I finally figured out how to reproduce it... Recently on one of the rancher env setups I wasn't able to reproduce this issue at all, which was very confusing because few days before It was definitely happening there. So then I've noticed that this cluster that hosts rancher was re-provisioned and it is in kind of "fresh" state. I've also checked that on previous iteration of that cluster were executed some tests for backup/restore rancher with backup-restore-operator. So on that "fresh" rancher setup I have provisioned rke2 cluster the same way it is described in this ticket, performed backup and restrore of rancher and then started simple test (modify cluster label and label in manifests defined in additional_manifest) again and my on 3rd iteration my cluster got to that state:

all nodes are deleted except this one I guess it is stuck on some finalizer.

so steps to reproduce it:

provision rancher env
create custom rke2 cluster, no node-pool is defined nodes are joined using nodeCommand
perform rancher backup following official doc
perform rancher restore following official doc with prune: false
modify rancher2_cluster_v2 change cluster labels and anything in additional_manifest X times until issue is reproduced.

Additional notes: when that issue happens and I destroy broken cluster (it disappear from rancher UI) then re-provision the same cluster back everything looks normal but that issue is still can be reproduced with this cluster just by modifying cluster config via terraform provider.

I hope that will help you to reproduce this.

riuvshyn commented 1 year ago

@jakefhyde any luck with reproducing it? 🙏🏽

riuvshyn commented 1 year ago

@jakefhyde I have an update on this one: I believe that this is rancher backup/restore operator causing this. Sometimes when I perform rancher restore operation I am hitting errors like that:

ERRO[2022/10/25 00:26:59] Error restoring resource mcc-rancher-euc1-te-test02-managed-system-upgrade-controller of type fleet.cattle.io/v1alpha1, Resource=bundledeployments: restoreResource: err updating status resource Operation cannot be fulfilled on bundledeployments.fleet.cattle.io "mcc-rancher-euc1-te-test02-managed-system-upgrade-controller": the object has been modified; please apply your changes to the latest version and try again
ERRO[2022/10/25 00:27:05] Error restoring resource mcc-rancher-euc1-te-test03-managed-system-upgrade-controller of type fleet.cattle.io/v1alpha1, Resource=bundledeployments: restoreResource: err updating status resource Operation cannot be fulfilled on bundledeployments.fleet.cattle.io "mcc-rancher-euc1-te-test03-managed-system-upgrade-controller": the object has been modified; please apply your changes to the latest version and try again
ERRO[2022/10/25 00:27:22] Error restoring namespaced resources [error restoring mcc-rancher-euc1-te-test02-managed-system-upgrade-controller of type fleet.cattle.io/v1alpha1, Resource=bundledeployments: restoreResource: err updating status resource Operation cannot be fulfilled on bundledeployments.fleet.cattle.io "mcc-rancher-euc1-te-test02-managed-system-upgrade-controller": the object has been modified; please apply your changes to the latest version and try again error restoring mcc-rancher-euc1-te-test03-managed-system-upgrade-controller of type fleet.cattle.io/v1alpha1, Resource=bundledeployments: restoreResource: err updating status resource Operation cannot be fulfilled on bundledeployments.fleet.cattle.io "mcc-rancher-euc1-te-test03-managed-system-upgrade-controller": the object has been modified; please apply your changes to the latest version and try again]

And that is happening I believe because rancher backup / restore operator is supposed to scale down rancher before doing actual restore and it is doing that

INFO[2022/10/25 00:29:23] Scaling down controllerRef apps/v1/deployments/rancher to 0

but it doesn't wait for it to actually be fully stopped and starts restore right away and since termination is not instant and restore is already happening it corrupts the data somehow and after restore is complete with such errors then this bug can be reproduced changing just a label via terraform on a managed cluster cause cluster deletion.

So maybe there is nothing to do with terraform provider actually...

Josh-Diamond commented 1 year ago

self-assigned this issue, as I was tasked w/ forwardport #998

riuvshyn commented 1 year ago

bug is still actual with rancher 2.7.0 and tf provider 1.25.0 reproduce steps are the same:

provision managed rke2 clsuter with terraform
change managed rke2 cluster configuration with terraform (changing label is enough)
repeat 2. X times

I just reproduced this from 6 attempts:

@Josh-Diamond any luck with reproducing this on your side?

riuvshyn commented 1 year ago

Reproduced on 2.7.1 I also noticed that issue is much easier to reproduce after performing rancher backup/restore might be it somehow related...

cc @Josh-Diamond @jakefhyde

jakefhyde commented 1 year ago

@riuvshyn Thank you for updating the issue, I think we've narrowed down the reproduction steps with the issue to start investigating this again.

riuvshyn commented 1 year ago

@jakefhyde oh thanks for sharing this, I was afraid that this bothers only me :) please let me know if you need any details about my setup.

davidhrbac commented 1 year ago

Rancher 2.7.2 and TF provider 1.25 - same here. Updated RKE, K3s clusters. RKE2 clusters destroyed and recreated.

jakefhyde commented 1 year ago

@riuvshyn Just out of curiosity, (I'm not having issues reproducing), but can you share the restore spec?

riuvshyn commented 1 year ago

@jakefhyde sure,

apiVersion: resources.cattle.io/v1
kind: Restore
metadata:
  name: my-restore
spec:
  backupFilename: test-**************.tar.gz

This was happening even without backup/restore but after backup/restore it was much easier to reproduce so I am not sure if that is related.

I will try to reproduce this with latest rancher and latest tf provider.

riuvshyn commented 1 year ago

@jakefhyde

Ok, got this reproduced on fresh setup, even without backup/restore rancher version: 2.7.3 terraform provider: 3.0.0 rke2 version: v1.24.9+rke2r2 terraform: 1.0.8

It took 19 iterations to reproduce the only change was applying is label:

resource "rancher2_cluster_v2" "this" {
...
  labels = {
    "provider.cattle.io" = "rke2"
    "test"               = "test_${formatdate("hh_mm_ss", timestamp())}"
  }
...
}

here is example of rancher2_cluster_v2 resource maybe that will help to reproduce it.

resource "rancher2_cluster_v2" "this" {
  name = "test-cluster"
  enable_network_policy = false
  kubernetes_version = "v1.24.9+rke2r2"
  default_cluster_role_for_project_members = "user"

  labels = {
    "provider.cattle.io" = "rke2"
    "test"               = "test_${formatdate("hh_mm_ss", timestamp())}"
  }

  annotations = {
    "aws.wise.com/region" = "eu-cenral-1"
    "ui.rancher/badge-color" = "#ffb619"
    "ui.rancher/badge-icon-text" = "TEST"
    "ui.rancher/badge-text" = "TEST"
  }

  rke_config {
    # Note: additional_manifest expects string manifests so here we join content of multiple manifests files and passing it as string.
    additional_manifest = join("\n", [
      templatefile("${path.module}/manifests/cilium.yaml", {
        cluster             = "test-cluster"
      }),
      templatefile("${path.module}/manifests/aws-cloud-controller-chart.yaml", {
        cluster             = "test-cluster"
      }),
      templatefile("${path.module}/manifests/rke2-coredns-values.yaml", {
        cluster             = "test-cluster"
      }),
      templatefile("${path.module}/manifests/node-local-dns-chart.yaml", {
        cluster             = "test-cluster"
      }),
      templatefile("${path.module}/manifests/crds.yaml", {
        cluster             = "test-cluster"
      }),
    ])
    registries {
      configs {
        hostname = "registry.com"
      }
      dynamic "mirrors" {
        for_each = local.container_registry_mirrors
        content {
          hostname  = mirrors.value["hostname"]
          endpoints = mirrors.value["endpoints"]
        }
      }
    }

    etcd {
      disable_snapshots      = false
      snapshot_schedule_cron = "0 */4 * * *" # every 6h
      snapshot_retention     = 6

      s3_config {
        bucket   = "backups"
        endpoint = "s3.${module.networking.vpc_region}.amazonaws.com"
        folder   = "etcd"
        region   = "eu-central-1"
      }
    }

    machine_global_config = <<EOF
    # CIS profile
    profile: cis-1.6
    protect-kernel-defaults: true
    cluster-cidr: 100.64.128.0/17
    cluster-dns: 100.64.0.2
    cluster-domain: cluster.local
    cni: cilium
    service-cidr: 100.64.0.0/17 
    # Note: This is disabled because we are using out-of-tree CCM
    disable-cloud-controller: true
    # Note: Kubelet is using the out-of-tree aws-cloud-controller-manager.
    #       See 'manifests/aws-cloud-controller.yaml' for the relevant manifests.
    cloud-provider-name: external
    etcd-expose-metrics: true
    kube-apiserver-arg:
      - allow-privileged=true
      - anonymous-auth=false
      - feature-gates=${join(",", ["CustomCPUCFSQuotaPeriod=true"])}
      - api-audiences=https://${module.oidc.oidc_issuer_domain},https://kubernetes.default.svc.cluster.local,rke2
      - audit-log-maxage=90
      - audit-log-maxbackup=10
      - audit-log-maxsize=500
      - audit-log-path=/var/log/k8s-audit/audit.log
      - audit-policy-file=/etc/kubernetes/audit-policy.yaml
      - authorization-mode=Node,RBAC
      - bind-address=0.0.0.0
      - enable-admission-plugins=PodSecurityPolicy,NodeRestriction
      - event-ttl=1h
      - kubelet-preferred-address-types=InternalIP,Hostname,ExternalIP
      - profiling=false
      - request-timeout=60s
      - runtime-config=api/all=true
      - service-account-key-file=/etc/kubernetes/service-account.pub
      - service-account-lookup=true
      - service-account-issuer=https://${module.oidc.oidc_issuer_domain}
      - service-account-signing-key-file=/etc/kubernetes/service-account.key
      - service-node-port-range=30000-32767
      - shutdown-delay-duration=60s
      - tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
      - tls-min-version=VersionTLS12
      - v=2
    kube-apiserver-extra-mount:
      - "/etc/kubernetes:/etc/kubernetes:ro"
      - "/var/log/k8s-audit:/var/log/k8s-audit:rw"
    kubelet-arg:
      - feature-gates=${join(",", ["CustomCPUCFSQuotaPeriod=true"])}
      - config=/etc/kubernetes/kubelet.yaml
      - exit-on-lock-contention=true
      - lock-file=/var/run/lock/kubelet.lock
      - pod-infra-container-image=registry.k8s.io/pause:3.1
      - register-node=true
      - tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
      - tls-min-version=VersionTLS12
      - v=4
    kube-controller-manager-arg:
      # Note: This will allow each node to have 256 assignable overlay addresses
      - node-cidr-mask-size=24
      - allocate-node-cidrs=true
      - attach-detach-reconcile-sync-period=1m0s
      # Note: Bind to all interfaces so that we can scrape the metrics.
      - bind-address=0.0.0.0
      - configure-cloud-routes=false
      # Note: Set custom feature gates that we have set in production
      - feature-gates=${join(",", ["CustomCPUCFSQuotaPeriod=true"])}
      - leader-elect=true
      - node-monitor-grace-period=2m
      - pod-eviction-timeout=220s
      - profiling=false
      - service-account-private-key-file=/etc/kubernetes/service-account.key
      - use-service-account-credentials=true
      - terminated-pod-gc-threshold=12500
      # Note: Set CIS 1.6 hardened recommended TLS cipher suites
      - tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
      - tls-min-version=VersionTLS12
    kube-controller-manager-extra-mount:
      - "/etc/kubernetes:/etc/kubernetes:ro"
    kube-proxy-arg:
      - conntrack-max-per-core=131072
      - conntrack-tcp-timeout-close-wait=0s
      - metrics-bind-address=0.0.0.0
      - proxy-mode=iptables
    kube-scheduler-arg:
      - bind-address=0.0.0.0
      - secure-port=10259
      - profiling=false
      - leader-elect=true
      - v=2
      # Note: Set CIS 1.6 hardened recommended TLS cipher suites
      - tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384
      - tls-min-version=VersionTLS12
    disable:
      - rke2-ingress-nginx
      - rke2-metrics-server
      - rke2-canal
    EOF

    upgrade_strategy {
      control_plane_concurrency = "1"

      worker_concurrency = "10%"

      control_plane_drain_options {
        enabled                              = false
        force                                = false
        ignore_daemon_sets                   = true
        ignore_errors                        = false
        delete_empty_dir_data                = true
        disable_eviction                     = false
        grace_period                         = 0
        timeout                              = 10800
        skip_wait_for_delete_timeout_seconds = 600
      }
      worker_drain_options {
        enabled                              = false
        force                                = false
        ignore_daemon_sets                   = true
        ignore_errors                        = false
        delete_empty_dir_data                = true
        disable_eviction                     = false
        grace_period                         = 0
        timeout                              = 10800
        skip_wait_for_delete_timeout_seconds = 600
      }
    }
  }
}

Oats87 commented 1 year ago

Did some digging into this and it looks like what is happening is terraform is clearing the finalizers on the provisioning.cattle.io object. Unfortunately, our generating controller will tell run an empty apply (deletion) if this is the case: https://github.com/rancher/rancher/blob/a05de31fccb10059447c169f28dcc2068982a6f0/pkg/controllers/provisioningv2/provisioningcluster/controller.go#L289-L292

This is a bug caused by problems in multiple components and while we can resolve it in the codebase for Rancher, I have not deduced a good workaround for this issue at this point. This is likely going to cause problems in other parts of the provider as well, for example, during deletion I would expect that wiping finalizers can lead to orphaned objects.

a-blender commented 1 year ago

@Oats87 Thank you for looking into this. I will investigate this issue

riuvshyn commented 1 year ago

As a workaround we are creating cluster object with kubectl provider applying it as k8s object that was working pretty stable so far but it is hard to maintain it this way so looking forward to be able to use terraform native cluster resource 🙏🏽

Sahota1225 commented 1 year ago

Moving the issue back to Q3

snasovich commented 1 year ago

Just to expand on the reasoning for the move to Q3, the issue should no longer be reproducible in about-to-be-released Rancher 2.7.5 due to https://github.com/rancher/rancher/issues/41887 fixed there.

This issue is now specifically to fix TFP side to no longer clear finalizers as it's still not ideal it does that. The priority is much lower though given this "data loss" issue should no longer be happening on Rancher 2.7.5+.

riuvshyn commented 12 months ago

I can confirm that I can not reproduce it anymore on 2.7.5! 🥳 🥳 🥳 🥳 🥳 cc @snasovich @jakefhyde @Oats87 Thank you very much!

snasovich commented 12 months ago

@riuvshyn , thank you for confirming this. We will however keep this issue open as we want to address TF provider removing finalizers as well.

rancher / terraform-provider-rancher2

[BUG] Occasionally RKE2 cluster gets destroyed after cluster configuration is changed using terraform provider #993