k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
27.39k stars 2.3k forks source link

`k3s etcd-snapshot` commands run against server specified in config file, instead of local server #10513

Closed brandond closed 1 month ago

brandond commented 1 month ago

Environmental Info: K3s Version: v1.30.2+k3s2

Node(s) CPU architecture, OS, and Version: n/a

Cluster Configuration: Any cluster using embedded etcd with more than one server

Describe the bug: This is a regression introduced by

When running k3s etcd-snapshot commands, the server flag defaults to the local server address, so etcd snaphots are created/listed/deleted on the local node. However, if the local server was joined to a cluster by specifying a server in the config file, the etcd-snapshot commands are executed against THAT server, instead of the local server.

This was reported in https://github.com/rancher/rke2/discussions/6284 but it took me a moment to realize what the user meant - I thought they were expecting the snapshot commands to be able to delete snapshots taken by other nodes (which is kind of what this is actually doing)

This is also likely the root cause of the multiple concurrent snapshot requests from https://github.com/k3s-io/k3s/issues/10371 - rancher's snapshot save commands were all being sent to the init node, instead of running locally on the individual servers.

Steps To Reproduce:

  1. Start a server with embedded etcd
  2. Start a second server, with the server: address of the first node specified in the config file.
  3. Take a snapshot on the second server
  4. Note that the snapshot is actually taken on the first server

Expected behavior: etcd-snapshot commands work against the local server by default, even when a server address is present in the config file

Actual behavior: As described above

Additional context / logs:

root@systemd-node-2:/# kubectl get node -o wide
NAME             STATUS   ROLES                       AGE     VERSION        INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
systemd-node-1   Ready    control-plane,etcd,master   5m28s   v1.30.2+k3s2   172.17.0.8    <none>        openSUSE Leap 15.4   6.6.0-1001-aws   containerd://1.7.17-k3s1
systemd-node-2   Ready    control-plane,etcd,master   5s      v1.30.2+k3s2   172.17.0.9    <none>        openSUSE Leap 15.4   6.6.0-1001-aws   containerd://1.7.17-k3s1

root@systemd-node-2:/# k3s etcd-snapshot save
WARN[0000] Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation.
INFO[0000] Snapshot on-demand-systemd-node-1-1720916780 saved.

root@systemd-node-2:/# k3s etcd-snapshot list
WARN[0000] Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation.
Name                                Location                                                                            Size    Created
on-demand-systemd-node-1-1720916780 file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-systemd-node-1-1720916780 3588128 2024-07-14T00:26:20Z

root@systemd-node-2:/# k3s etcd-snapshot save --help | grep server
   --token value, -t value                                      (cluster) Shared secret used to join a server or agent to a cluster [$K3S_TOKEN]
   --server value, -s value                                     (cluster) Server to connect to (default: "https://127.0.0.1:6443") [$K3S_URL]

root@systemd-node-2:/# cat /etc/rancher/k3s/config.yaml
server: https://172.17.0.8:6443
token: token

root@systemd-node-2:/# k3s etcd-snapshot save --server https://localhost:6443
WARN[0000] Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation.
INFO[0000] Snapshot on-demand-systemd-node-2-1720916809 saved.

root@systemd-node-2:/# k3s etcd-snapshot list --server https://localhost:6443
WARN[0000] Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation.
Name                                Location                                                                            Size    Created
on-demand-systemd-node-2-1720916809 file:///var/lib/rancher/k3s/server/db/snapshots/on-demand-systemd-node-2-1720916809 3584032 2024-07-14T00:26:49Z
rancher-max commented 1 month ago

Some other testing considerations additional to what has already been listed

  1. Check the file location itself on both nodes
  2. Specify the --server arg in the command as the other node. For example, using the examples above, from node-2, run: k3s etcd-snapshot save --server https://172.17.0.8:6443
brandond commented 1 month ago

I am fixing this by changing the flags for server/token to etcd-server/etcd-token. The use case for this was primarily for folks that for some reason changed the bind address or supervisor port and needed to override the server address to match. We weren't REALLY expecting folks to run the command against other nodes.

fmoral2 commented 1 month ago

Validated on Version:

-$ k3s version v1.30.2+k3s-37830fe1 (37830fe1)

Environment Details

Infrastructure Cloud EC2 instance

Node(s) CPU architecture, OS, and Version: ubuntu AMD

Cluster Configuration: -3 node server -1 node agents

Steps to validate the fix

  1. Install k3s etcd embedded
  2. Take etcd snapshot on the second server
  3. Validate its taken on the correct place ( second server )

Reproduction Issue:

``` k3s version v1.30.2+k3s-58ab2592 (58ab2592) Server 1 - ip 172-test1 Server 2 - ip 172-test2 Server 3 - ip 172-test3 on Server 2 k3s etcd-snapshot save INFO[0000] Snapshot on-demand-ip-172-test1.us-east-2.compute.internal-1721316392 saved. - Saved on first Server k3s etcd-snapshot list Name Location Size Created on-demand-ip-172-test1.us-east-2.compute.internal-1721316392 file://{redacted}ip-172-test1.us-east-2.compute.internal- 2024-07-18T15:26:32Z - List shows the snapshot is saved on the first server ``` ## **Validation Results:**
``` k3s version v1.30.2+k3s-37830fe1 (37830fe1) Server 1 - ip 172-test1 Server 2 - ip 172-test2 Server 3 - ip 172-test3 - on Server 2 $ sudo k3s etcd-snapshot save INFO[0001] Snapshot on-demand-ip-172-test2.us-east-2.compute.internal-1721321853 saved. k3s etcd-snapshot list Name Location Size Created on-demand-ip-172-test2.us-east-2.compute.internal-1721321853 file://{redacted}ip-172-test2.us-east-2.compute.internal- 2024-07-18T15:26:32Z $ sudo k3s etcd-snapshot save --etcd-server https://localhost:6443 INFO[0000] Snapshot on-demand-ip-172-test2.us-east-2.compute.internal-1721322104 saved. Snapshot pointing to first server also works ~$ sudo k3s etcd-snapshot save --etcd-server https://172-test1:6443 INFO[0000] Snapshot on-demand-ip-172-test1.us-east-2.compute.internal-1721322170 saved. - on server 1 $ sudo k3s etcd-snapshot save INFO[0003] Snapshot on-demand-ip-172-test1.us-east-2.compute.internal-1721321767 saved. k3s etcd-snapshot list Name Location Size Created on-demand-ip-172-test1.us-east-2.compute.internal-1721321853 file://{redacted}ip-172-test2.us-east-2.compute.internal- 2024-07-18T15:26:32Z ```