k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
27.87k stars 2.33k forks source link

`etcdsnapshotfile` resource from deleted node stuck on finalizer when deleted node was running `managed-etcd-snapshots-controller` #8678

Open brandond opened 1 year ago

brandond commented 1 year ago

from @aganesh-suse

Steps to reproduce:

  1. Start a 3-server etcd cluster
  2. Take 1 snapshot on each node
  3. Stop k3s on one of the servers, and delete it from the cluster using kubectl.
  4. Remove the database files from the stopped server
  5. Restart the stopped node with --node-name overridden to a new unique value
  6. Take 1 snapshot on the new node
  7. Run kubectl get etcdsnapshotfile

Note that the snapshot taken before the node-name was changed is not updated to reflect the fact that it is on the new node. It still appears to be on the deleted node.

brandond commented 1 year ago

I am unable to reproduce this. I followed the above procedure and I see that the local snapshots show the new node.

root@k3s-server-3:/# cat /etc/rancher/k3s/config.yaml
node-name: k3s-server-4

root@k3s-server-3:/# kubectl get node -o wide
NAME           STATUS   ROLES                       AGE     VERSION                INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
k3s-server-1   Ready    control-plane,etcd,master   9m42s   v1.28.2+k3s-3db1d332   172.17.0.3    <none>        Ubuntu 22.04.3 LTS   5.19.0-1019-aws   containerd://1.7.7-k3s1
k3s-server-2   Ready    control-plane,etcd,master   9m26s   v1.28.2+k3s-3db1d332   172.17.0.4    <none>        Ubuntu 22.04.3 LTS   5.19.0-1019-aws   containerd://1.7.7-k3s1
k3s-server-4   Ready    control-plane,etcd,master   4m56s   v1.28.2+k3s-3db1d332   172.17.0.5    <none>        Ubuntu 22.04.3 LTS   5.19.0-1019-aws   containerd://1.7.7-k3s1

root@k3s-server-3:/# kubectl get etcdsnapshotfile
NAME                                             SNAPSHOTNAME                        NODE           LOCATION                                                                            SIZE      CREATIONTIME
local-on-demand-k3s-server-3-1697657808-80c501   on-demand-k3s-server-3-1697657808   k3s-server-4   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-3-1697657808   2355232   2023-10-18T19:36:48Z
local-on-demand-k3s-server-3-1697657865-732f1f   on-demand-k3s-server-3-1697657865   k3s-server-4   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-3-1697657865   2699296   2023-10-18T19:37:45Z
local-on-demand-k3s-server-3-1697658041-5cd74f   on-demand-k3s-server-3-1697658041   k3s-server-4   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-3-1697658041   3354656   2023-10-18T19:40:41Z
local-on-demand-k3s-server-3-1697658179-86ca69   on-demand-k3s-server-3-1697658179   k3s-server-4   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-3-1697658179   3727392   2023-10-18T19:42:59Z
local-on-demand-k3s-server-4-1697658218-0b3ff6   on-demand-k3s-server-4-1697658218   k3s-server-4   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-4-1697658218   3854368   2023-10-18T19:43:38Z
brandond commented 1 year ago

I rejoined the node a second time and see everything getting updated properly then as well:

root@k3s-server-3:/# cat /etc/rancher/k3s/config.yaml
node-name: k3s-server-5

root@k3s-server-3:/# kubectl get node -o wide
NAME           STATUS   ROLES                       AGE   VERSION                INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
k3s-server-1   Ready    control-plane,etcd,master   16m   v1.28.2+k3s-3db1d332   172.17.0.3    <none>        Ubuntu 22.04.3 LTS   5.19.0-1019-aws   containerd://1.7.7-k3s1
k3s-server-2   Ready    control-plane,etcd,master   16m   v1.28.2+k3s-3db1d332   172.17.0.4    <none>        Ubuntu 22.04.3 LTS   5.19.0-1019-aws   containerd://1.7.7-k3s1
k3s-server-5   Ready    control-plane,etcd,master   61s   v1.28.2+k3s-3db1d332   172.17.0.5    <none>        Ubuntu 22.04.3 LTS   5.19.0-1019-aws   containerd://1.7.7-k3s1

root@k3s-server-3:/# k3s etcd-snapshot list
Name                              Location                                                                          Size    Created
on-demand-k3s-server-3-1697657808 file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-3-1697657808 2355232 2023-10-18T19:36:48Z
on-demand-k3s-server-3-1697657865 file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-3-1697657865 2699296 2023-10-18T19:37:45Z
on-demand-k3s-server-3-1697658041 file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-3-1697658041 3354656 2023-10-18T19:40:41Z
on-demand-k3s-server-3-1697658179 file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-3-1697658179 3727392 2023-10-18T19:42:59Z
on-demand-k3s-server-4-1697658218 file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-4-1697658218 3854368 2023-10-18T19:43:38Z
on-demand-k3s-server-5-1697658713 file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-5-1697658713 2588704 2023-10-18T19:51:53Z

root@k3s-server-3:/# kubectl get etcdsnapshotfile
NAME                                             SNAPSHOTNAME                        NODE           LOCATION                                                                            SIZE      CREATIONTIME
local-on-demand-k3s-server-3-1697657808-aed2c6   on-demand-k3s-server-3-1697657808   k3s-server-5   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-3-1697657808   2355232   2023-10-18T19:36:48Z
local-on-demand-k3s-server-3-1697657865-d882e4   on-demand-k3s-server-3-1697657865   k3s-server-5   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-3-1697657865   2699296   2023-10-18T19:37:45Z
local-on-demand-k3s-server-3-1697658041-19ec7a   on-demand-k3s-server-3-1697658041   k3s-server-5   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-3-1697658041   3354656   2023-10-18T19:40:41Z
local-on-demand-k3s-server-3-1697658179-4ed454   on-demand-k3s-server-3-1697658179   k3s-server-5   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-3-1697658179   3727392   2023-10-18T19:42:59Z
local-on-demand-k3s-server-4-1697658218-99c6ff   on-demand-k3s-server-4-1697658218   k3s-server-5   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-4-1697658218   3854368   2023-10-18T19:43:38Z
local-on-demand-k3s-server-5-1697658713-7e1f33   on-demand-k3s-server-5-1697658713   k3s-server-5   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-k3s-server-5-1697658713   2588704   2023-10-18T19:51:53Z
aganesh-suse commented 1 year ago

Ubuntu 22.04. HA: 3 server/1 agent setup Recording the test results i shared offline with Brad. P.S: I have 2 setups with 1 working fine and 1 with this issue seen on.

Config file:

 $ cat /etc/rancher/k3s/config.yaml 
token: secret
node-name: "server1"
etcd-snapshot-retention: 2
etcd-snapshot-schedule-cron: "* * * * *"
etcd-s3: true
etcd-s3-access-key: xxxx
etcd-s3-secret-key: xxxx
etcd-s3-bucket: bucket
etcd-s3-folder: folder
etcd-s3-region: us-east-2

cluster-init: true
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1
node-label:
- k3s-upgrade=server

2) Install k3s :

curl -sfL https://get.k3s.io | sudo INSTALL_K3S_COMMIT='3db1d33282765b8fad8ff0a5ec763a4d2487ee9f' sh -s - server

3) The cron takes a snapshot every minute. Sleep for 2 to 3 minutes; Update the node name with suffix1 for all 4 nodes, restart the services. Ex: server1-8581 4) Do step 3 once more. Add another suffix to the node name in this step: Ex: server1-17695-8581 5) Save etcd snapshot on demand(5 snapshots), prune with retention of 3 and delete 1 snapshot.

sudo k3s etcd-snapshot save
sudo k3s etcd-snapshot prune --snapshot-retention 3
sudo k3s etcd-snapshot delete <on demand snapshot>

Outputs:

$ sudo k3s etcd-snapshot save --debug
WARN[0000] Unknown flag --token found in config.yaml, skipping 
WARN[0000] Unknown flag --etcd-snapshot-retention found in config.yaml, skipping 
WARN[0000] Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping 
WARN[0000] Unknown flag --cluster-init found in config.yaml, skipping 
WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping 
WARN[0000] Unknown flag --node-external-ip found in config.yaml, skipping 
WARN[0000] Unknown flag --node-label found in config.yaml, skipping 
WARN[0000] Unknown flag --server found in config.yaml, skipping 
DEBU[0000] Attempting to retrieve extra metadata from k3s-etcd-snapshot-extra-metadata ConfigMap 
DEBU[0000] Error encountered attempting to retrieve extra metadata from k3s-etcd-snapshot-extra-metadata ConfigMap, error: configmaps "k3s-etcd-snapshot-extra-metadata" not found 
INFO[0000] Saving etcd snapshot to /var/lib/rancher/k3s/server/db/snapshots/on-demand-server1-17695-8581-1697670509 
{"level":"info","ts":"2023-10-18T23:08:29.188729Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-server1-17695-8581-1697670509.part"}
{"level":"info","ts":"2023-10-18T23:08:29.191305Z","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2023-10-18T23:08:29.191368Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
{"level":"info","ts":"2023-10-18T23:08:29.33019Z","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2023-10-18T23:08:29.367338Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"12 MB","took":"now"}
{"level":"info","ts":"2023-10-18T23:08:29.367585Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/k3s/server/db/snapshots/on-demand-server1-17695-8581-1697670509"}
INFO[0000] Checking if S3 bucket sonobuoy-results exists 
INFO[0000] S3 bucket sonobuoy-results exists            
INFO[0000] Saving etcd snapshot on-demand-server1-17695-8581-1697670509 to S3 
INFO[0000] Uploading snapshot to s3://sonobuoy-results//var/lib/rancher/k3s/server/db/snapshots/on-demand-server1-17695-8581-1697670509 
INFO[0000] Uploaded snapshot metadata s3://sonobuoy-results//var/lib/rancher/k3s/server/db/.metadata/on-demand-server1-17695-8581-1697670509 
INFO[0000] S3 upload complete for on-demand-server1-17695-8581-1697670509 
INFO[0000] Reconciling ETCDSnapshotFile resources       
DEBU[0000] Found snapshotFile for etcd-snapshot-server1-17695-8581-1697670485 with key local-etcd-snapshot-server1-17695-8581-1697670485 
DEBU[0000] Found snapshotFile for on-demand-server1-17695-8581-1697667654 with key local-on-demand-server1-17695-8581-1697667654 
DEBU[0000] Found snapshotFile for on-demand-server1-17695-8581-1697667659 with key local-on-demand-server1-17695-8581-1697667659 
DEBU[0000] Found snapshotFile for on-demand-server1-17695-8581-1697670509 with key local-on-demand-server1-17695-8581-1697670509 
DEBU[0000] Found snapshotFile for etcd-snapshot-server1-17695-8581-1697670423 with key s3-etcd-snapshot-server1-17695-8581-1697670423 
DEBU[0000] Found snapshotFile for etcd-snapshot-server1-17695-8581-1697670423 with key local-etcd-snapshot-server1-17695-8581-1697670423 
DEBU[0000] Found snapshotFile for etcd-snapshot-server1-17695-8581-1697670485 with key s3-etcd-snapshot-server1-17695-8581-1697670485 
DEBU[0000] Found snapshotFile for on-demand-server1-17695-8581-1697667654 with key s3-on-demand-server1-17695-8581-1697667654 
DEBU[0000] Found snapshotFile for on-demand-server1-17695-8581-1697667659 with key s3-on-demand-server1-17695-8581-1697667659 
DEBU[0000] Found snapshotFile for on-demand-server1-17695-8581-1697670509 with key s3-on-demand-server1-17695-8581-1697670509 
DEBU[0000] Found ETCDSnapshotFile for etcd-snapshot-server1-17695-8581-1697667303 with key local-etcd-snapshot-server1-17695-8581-1697667303 
DEBU[0000] Key local-etcd-snapshot-server1-17695-8581-1697667303 not found in snapshotFile list 
INFO[0000] Deleting ETCDSnapshotFile for etcd-snapshot-server1-17695-8581-1697667303 
DEBU[0000] Found ETCDSnapshotFile for etcd-snapshot-server1-17695-8581-1697667362 with key local-etcd-snapshot-server1-17695-8581-1697667362 
DEBU[0000] Key local-etcd-snapshot-server1-17695-8581-1697667362 not found in snapshotFile list 
INFO[0000] Deleting ETCDSnapshotFile for etcd-snapshot-server1-17695-8581-1697667362 
DEBU[0000] Found ETCDSnapshotFile for etcd-snapshot-server1-17695-8581-1697670423 with key local-etcd-snapshot-server1-17695-8581-1697670423 
DEBU[0000] Found ETCDSnapshotFile for etcd-snapshot-server1-17695-8581-1697670485 with key local-etcd-snapshot-server1-17695-8581-1697670485 
DEBU[0000] Found ETCDSnapshotFile for on-demand-server1-17695-8581-1697667654 with key local-on-demand-server1-17695-8581-1697667654 
DEBU[0000] Found ETCDSnapshotFile for on-demand-server1-17695-8581-1697667659 with key local-on-demand-server1-17695-8581-1697667659 
DEBU[0000] Found ETCDSnapshotFile for on-demand-server1-17695-8581-1697670509 with key local-on-demand-server1-17695-8581-1697670509 
DEBU[0000] Found ETCDSnapshotFile for etcd-snapshot-server1-17695-8581-1697667303 with key s3-etcd-snapshot-server1-17695-8581-1697667303 
DEBU[0000] Key s3-etcd-snapshot-server1-17695-8581-1697667303 not found in snapshotFile list 
INFO[0000] Deleting ETCDSnapshotFile for etcd-snapshot-server1-17695-8581-1697667303 
DEBU[0000] Found ETCDSnapshotFile for etcd-snapshot-server1-17695-8581-1697667362 with key s3-etcd-snapshot-server1-17695-8581-1697667362 
DEBU[0000] Key s3-etcd-snapshot-server1-17695-8581-1697667362 not found in snapshotFile list 
INFO[0000] Deleting ETCDSnapshotFile for etcd-snapshot-server1-17695-8581-1697667362 
DEBU[0000] Found ETCDSnapshotFile for etcd-snapshot-server1-17695-8581-1697670423 with key s3-etcd-snapshot-server1-17695-8581-1697670423 
DEBU[0000] Found ETCDSnapshotFile for etcd-snapshot-server1-17695-8581-1697670485 with key s3-etcd-snapshot-server1-17695-8581-1697670485 
DEBU[0000] Found ETCDSnapshotFile for on-demand-server1-17695-8581-1697667654 with key s3-on-demand-server1-17695-8581-1697667654 
DEBU[0000] Found ETCDSnapshotFile for on-demand-server1-17695-8581-1697667659 with key s3-on-demand-server1-17695-8581-1697667659 
DEBU[0000] Found ETCDSnapshotFile for on-demand-server1-17695-8581-1697670509 with key s3-on-demand-server1-17695-8581-1697670509 
INFO[0000] Reconciliation of ETCDSnapshotFile resources complete

Note the node name "server1-8581" recorded in the output below(previous node name):

kubectl get etcdsnapshotfile
NAME                                                       SNAPSHOTNAME                                  NODE                 LOCATION                                                                                                         SIZE       CREATIONTIME
local-etcd-snapshot-server1-17695-8581-1697667303-96f01c   etcd-snapshot-server1-17695-8581-1697667303   server1-17695-8581   file:///var/lib/rancher/k3s/server/db/snapshots/etcd-snapshot-server1-17695-8581-1697667303                      7155744    2023-10-18T22:15:03Z
local-etcd-snapshot-server1-17695-8581-1697667362-2244bc   etcd-snapshot-server1-17695-8581-1697667362   server1-17695-8581   file:///var/lib/rancher/k3s/server/db/snapshots/etcd-snapshot-server1-17695-8581-1697667362                      7557152    2023-10-18T22:16:02Z
local-etcd-snapshot-server1-17695-8581-1697670423-aa7833   etcd-snapshot-server1-17695-8581-1697670423   server1-17695-8581   file:///var/lib/rancher/k3s/server/db/snapshots/etcd-snapshot-server1-17695-8581-1697670423                      12505120   2023-10-18T23:07:03Z
local-etcd-snapshot-server1-17695-8581-1697670485-2a4faf   etcd-snapshot-server1-17695-8581-1697670485   server1-17695-8581   file:///var/lib/rancher/k3s/server/db/snapshots/etcd-snapshot-server1-17695-8581-1697670485                      12505120   2023-10-18T23:08:05Z
local-etcd-snapshot-server1-8581-1697667062-1f9348         etcd-snapshot-server1-8581-1697667062         server1-8581         file:///var/lib/rancher/k3s/server/db/snapshots/etcd-snapshot-server1-8581-1697667062                            5410848    2023-10-18T22:11:02Z
local-etcd-snapshot-server1-8581-1697667243-888329         etcd-snapshot-server1-8581-1697667243         server1-8581         file:///var/lib/rancher/k3s/server/db/snapshots/etcd-snapshot-server1-8581-1697667243                            6762528    2023-10-18T22:14:03Z
local-on-demand-server1-17695-8581-1697667654-951f0b       on-demand-server1-17695-8581-1697667654       server1-17695-8581   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-server1-17695-8581-1697667654                          10063904   2023-10-18T22:20:54Z
local-on-demand-server1-17695-8581-1697667659-36d75d       on-demand-server1-17695-8581-1697667659       server1-17695-8581   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-server1-17695-8581-1697667659                          10113056   2023-10-18T22:20:59Z
local-on-demand-server1-17695-8581-1697670509-f1c5c5       on-demand-server1-17695-8581-1697670509       server1-17695-8581   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-server1-17695-8581-1697670509                          12505120   2023-10-18T23:08:29Z
local-on-demand-server2-17695-8581-1697667662-7ca7eb       on-demand-server2-17695-8581-1697667662       server2-17695-8581   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-server2-17695-8581-1697667662                          10231840   2023-10-18T22:21:02Z
local-on-demand-server2-17695-8581-1697669375-678e95       on-demand-server2-17695-8581-1697669375       server2-17695-8581   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-server2-17695-8581-1697669375                          12689440   2023-10-18T22:49:35Z
local-on-demand-server3-17695-8581-1697667640-fd4d00       on-demand-server3-17695-8581-1697667640       server3-17695-8581   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-server3-17695-8581-1697667640                          9834528    2023-10-18T22:20:40Z
local-on-demand-server3-17695-8581-1697667646-82845d       on-demand-server3-17695-8581-1697667646       server3-17695-8581   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-server3-17695-8581-1697667646                          10002464   2023-10-18T22:20:46Z
local-on-demand-server3-17695-8581-1697667652-a77e6d       on-demand-server3-17695-8581-1697667652       server3-17695-8581   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-server3-17695-8581-1697667652                          10068k     2023-10-18T22:20:52Z
local-on-demand-server3-17695-8581-1697667658-2cf5ad       on-demand-server3-17695-8581-1697667658       server3-17695-8581   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-server3-17695-8581-1697667658                          10186784   2023-10-18T22:20:58Z
local-on-demand-server3-17695-8581-1697667663-0a4f57       on-demand-server3-17695-8581-1697667663       server3-17695-8581   file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-server3-17695-8581-1697667663                          10235936   2023-10-18T22:21:03Z
s3-etcd-snapshot-server1-17695-8581-1697667303-c76d87      etcd-snapshot-server1-17695-8581-1697667303   server1-17695-8581   s3://sonobuoy-results/arch-k3ssnap/commit-setup/server1-17695-8581/etcd-snapshot-server1-17695-8581-1697667303   7155744    2023-10-18T22:15:03Z
s3-etcd-snapshot-server1-17695-8581-1697667362-9303ba      etcd-snapshot-server1-17695-8581-1697667362   server1-17695-8581   s3://sonobuoy-results/arch-k3ssnap/commit-setup/server1-17695-8581/etcd-snapshot-server1-17695-8581-1697667362   7557152    2023-10-18T22:16:02Z
s3-etcd-snapshot-server1-17695-8581-1697670423-70d6fd      etcd-snapshot-server1-17695-8581-1697670423   server1-17695-8581   s3://sonobuoy-results/arch-k3ssnap/commit-setup/server1-17695-8581/etcd-snapshot-server1-17695-8581-1697670423   12505120   2023-10-18T23:07:03Z
s3-etcd-snapshot-server1-17695-8581-1697670485-db03ba      etcd-snapshot-server1-17695-8581-1697670485   server1-17695-8581   s3://sonobuoy-results/arch-k3ssnap/commit-setup/server1-17695-8581/etcd-snapshot-server1-17695-8581-1697670485   12505120   2023-10-18T23:08:05Z
s3-on-demand-server1-17695-8581-1697667654-098090          on-demand-server1-17695-8581-1697667654       server1-17695-8581   s3://sonobuoy-results/arch-k3ssnap/commit-setup/server1-17695-8581/on-demand-server1-17695-8581-1697667654       10063904   2023-10-18T22:20:54Z
s3-on-demand-server1-17695-8581-1697667659-737d88          on-demand-server1-17695-8581-1697667659       server1-17695-8581   s3://sonobuoy-results/arch-k3ssnap/commit-setup/server1-17695-8581/on-demand-server1-17695-8581-1697667659       10113056   2023-10-18T22:20:59Z
s3-on-demand-server1-17695-8581-1697670509-8e79a8          on-demand-server1-17695-8581-1697670509       server1-17695-8581   s3://sonobuoy-results/arch-k3ssnap/commit-setup/server1-17695-8581/on-demand-server1-17695-8581-1697670509       12505120   2023-10-18T23:08:29Z

Current node names:

$ kubectl get nodes
NAME                 STATUS   ROLES                       AGE   VERSION
agent1-17695-8581    Ready    <none>                      74m   v1.28.2+k3s-3db1d332
server1-17695-8581   Ready    control-plane,etcd,master   77m   v1.28.2+k3s-3db1d332
server2-17695-8581   Ready    control-plane,etcd,master   76m   v1.28.2+k3s-3db1d332
server3-17695-8581   Ready    control-plane,etcd,master   74m   v1.28.2+k3s-3db1d332
sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots/
total 56360
-rw------- 1 root root 10063904 Oct 18 22:20 on-demand-server1-17695-8581-1697667654
-rw------- 1 root root 10113056 Oct 18 22:20 on-demand-server1-17695-8581-1697667659
-rw------- 1 root root 12505120 Oct 18 23:08 on-demand-server1-17695-8581-1697670509
-rw------- 1 root root 12505120 Oct 18 23:13 etcd-snapshot-server1-17695-8581-1697670783
-rw------- 1 root root 12505120 Oct 18 23:14 etcd-snapshot-server1-17695-8581-1697670842
brandond commented 1 year ago

It appears that cleanup of snapshots from deleted nodes is working as designed, however there appears to be the possibility of a stuck finalizer on etcdsnapshotfile resources if the snapshot controller is running on the node that is deleted.

In order to avoid this, the node should be stopped for a short period of time (at least a minute to be safe) before being deleted, so that leader-elected controllers can migrate to other nodes.

root@ip-172-31-26-200:~# kubectl get node -l node-role.kubernetes.io/etcd=true
NAME                 STATUS   ROLES                       AGE   VERSION
server1-17695-8581   Ready    control-plane,etcd,master   95m   v1.28.2+k3s-3db1d332
server2-17695-8581   Ready    control-plane,etcd,master   94m   v1.28.2+k3s-3db1d332
server3-17695-8581   Ready    control-plane,etcd,master   93m   v1.28.2+k3s-3db1d332

root@ip-172-31-26-200:~# kubectl get etcdsnapshotfile -l 'etcd.k3s.cattle.io/snapshot-storage-node notin (s3,server1-17695-8581,server2-17695-8581,server3-17695-8581)'
NAME                                                 SNAPSHOTNAME                            NODE           LOCATION                                                                                SIZE      CREATIONTIME
local-etcd-snapshot-server1-8581-1697667062-1f9348   etcd-snapshot-server1-8581-1697667062   server1-8581   file:///var/lib/rancher/k3s/server/db/snapshots/etcd-snapshot-server1-8581-1697667062   5410848   2023-10-18T22:11:02Z
local-etcd-snapshot-server1-8581-1697667243-888329   etcd-snapshot-server1-8581-1697667243   server1-8581   file:///var/lib/rancher/k3s/server/db/snapshots/etcd-snapshot-server1-8581-1697667243   6762528   2023-10-18T22:14:03Z

root@ip-172-31-26-200:~# kubectl get etcdsnapshotfile local-etcd-snapshot-server1-8581-1697667062-1f9348 -o yaml | grep -C1 -E 'deletionTimestamp|finalizers'
  deletionGracePeriodSeconds: 0
  deletionTimestamp: "2023-10-18T22:20:04Z"
  finalizers:
  - wrangler.cattle.io/managed-etcd-snapshots-controller

This is a bit of a known issue with wrangler OnDelete handlers; in order to fix it we would need to add code to remove the stuck finalizer.

As a workaround until this can be implemented, the following command can be run to manually clear the finalizers on any snapshots for deleted nodes:

for ESF in $(kubectl get etcdsnapshotfile -o=go-template --template '{{range .items}}{{.metadata.name}} {{end}}' -l 'etcd.k3s.cattle.io/snapshot-storage-node notin (s3,'$(kubectl get node -l node-role.kubernetes.io/etcd=true -o=go-template --template '{{range .items}}{{.metadata.name}},{{end}}')')'); do
  kubectl patch etcdsnapshotfile $ESF -p '{"metadata":{"finalizers":null}}' --type=merge;
done
caroline-suse-rancher commented 10 months ago

Hey @brandond is this a confirmed issue you're actively working on, or should it be up for grabs?

brandond commented 10 months ago

I mostly left it here just to document it as a possible problem further down the road, and demonstrate the steps to fix it. If it does become a problem and/or someone wants to fix it, it's up for grabs.