rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.5k stars 261 forks source link

Removing snapshot using rke2 cli, only removes them on one node. #6283

Closed olivierHa closed 2 months ago

olivierHa commented 2 months ago

Environmental Info:

RKE2 Version: 
rke2 version v1.30.0+rke2r1 (60e06c4dbccff996f717af8f4c532971f57264b4)
go version go1.22.2 X:boringcrypto

Node(s) CPU architecture, OS, and Version: 
Linux test-u01 6.1.0-21-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.90-1 (2024-05-03) x86_64 GNU/Linux

Cluster Configuration:

3 servers / 3 Agents

Describe the bug:

"rke2 snapshot command" don't let us to remove snapshots on all nodes. Documentation about pruning snapshot isn't up to date.

Steps To Reproduce:

on 1st node :

ls -l /var/lib/rancher/rke2/server/db/snapshots | wc -l 57 Then :

rke2 etcd-snapshot prune --etcd-snapshot-retention 5 --name etcd-snapshot

This runs without issues, but it seems to remove only the snapshot on one node only :

[...]
INFO[0008] Snapshot etcd-snapshot-test-u01.dc01-1720324800 deleted. 
INFO[0008] Snapshot etcd-snapshot-test-u01.dc01-1720303201 deleted. 
INFO[0008] Snapshot etcd-snapshot-test-u01.dc01-1720281600 deleted. 

On another etcd node, the snapshot directory is full :

ls -l /var/lib/rancher/rke2/server/db/snapshots | wc  -l
57

On other nodes, running the previous command does nothing and listing snapshots is showing the snapshop on the first node only

Could you clarify documentation with the "good" command to prune etcd snaphosts on all etcd nodes ?

Expected behavior:

Pruning snaphost on all nodes, or at least running the same command on every nodes should work.

Actual behavior:

Only pruning snaphost on one node.

Additional context / logs:

The documentation here ( https://docs.rke2.io/backup_restore#prune-snapshots ) tells us to run the following command :

rke2 etcd-snapshot prune --snapshot-retention <NUM-OF-SNAPSHOTS-TO-RETAIN>

When running with a value, I got the following :

Incorrect Usage: Cannot use two forms of the same flag: etcd-snapshot-retention snapshot-retention

NAME:
   rke2 etcd-snapshot prune - Remove snapshots that match the name prefix that exceed the configured retention count

USAGE:
   rke2 etcd-snapshot prune [command options] [arguments...]

OPTIONS:
   --debug                                                      (logging) Turn on debug logs [$RKE2_DEBUG]
   --config FILE, -c FILE                                       (config) Load configuration from FILE (default: "/etc/rancher/rke2/config.yaml") [$RKE2_CONFIG_FILE]
   --node-name value                                            (agent/node) Node name [$RKE2_NODE_NAME]
   --data-dir value, -d value                                   (data) Folder to hold state (default: "/var/lib/rancher/rke2")
   --token value, -t value                                      (cluster) Shared secret used to join a server or agent to a cluster [$RKE2_TOKEN]
   --server value, -s value                                     (cluster) Server to connect to (default: "https://127.0.0.1:9345") [$RKE2_URL]
   --dir value, --etcd-snapshot-dir value                       (db) Directory to save etcd on-demand snapshot. (default: ${data-dir}/db/snapshots)
   --name value                                                 (db) Set the base name of the etcd on-demand snapshot (appended with UNIX timestamp). (default: "on-demand")
   --snapshot-compress, --etcd-snapshot-compress                (db) Compress etcd snapshot
   --snapshot-retention value, --etcd-snapshot-retention value  (db) Number of snapshots to retain. (default: 5)
   --s3, --etcd-s3                                              (db) Enable backup to S3
   --s3-endpoint value, --etcd-s3-endpoint value                (db) S3 endpoint url (default: "s3.amazonaws.com")
   --s3-endpoint-ca value, --etcd-s3-endpoint-ca value          (db) S3 custom CA cert to connect to S3 endpoint
   --s3-skip-ssl-verify, --etcd-s3-skip-ssl-verify              (db) Disables S3 SSL certificate validation
   --s3-access-key value, --etcd-s3-access-key value            (db) S3 access key [$AWS_ACCESS_KEY_ID]
   --s3-secret-key value, --etcd-s3-secret-key value            (db) S3 secret key [$AWS_SECRET_ACCESS_KEY]
   --s3-bucket value, --etcd-s3-bucket value                    (db) S3 bucket name
   --s3-region value, --etcd-s3-region value                    (db) S3 region / bucket location (optional) (default: "us-east-1")
   --s3-folder value, --etcd-s3-folder value                    (db) S3 folder
   --s3-insecure, --etcd-s3-insecure                            (db) Disables S3 over HTTPS
   --s3-timeout value, --etcd-s3-timeout value                  (db) S3 timeout (default: 5m0s)

FATA[0000] Cannot use two forms of the same flag: etcd-snapshot-retention snapshot-retention 

The documentation is at least not up to date.

brandond commented 2 months ago

This runs without issues, but it seems to remove only the snapshot on one node only : On another etcd node, the snapshot directory is full :

Snapshots are stored locally on each node. Pruning on one node only removes snapshots from that node. What did you see in the docs that made you expect that running the prune command on one node would also have any effect on other nodes?

the documentation here tells us to run the following command : rke2 etcd-snapshot prune --snapshot-retention

When running with a value, I got the following : Incorrect Usage: Cannot use two forms of the same flag: etcd-snapshot-retention snapshot-retention --snapshot-retention value, --etcd-snapshot-retention value (db) Number of snapshots to retain. (default: 5)

This error suggests that you've already got etcd-snapshot-retention set in your config file. If you have it set in the config file AND on the CLI, but using different forms, the go CLI framework will raise the error you're seeing.