k3s-io / k3s

Lightweight Kubernetes
https://k3s.io
Apache License 2.0
28.63k stars 2.38k forks source link

Misleading default value in --etcd-snapshot-dir opt descriptions (`server`, `etcd-snapshot`) #11570

Open majabojarska opened 1 week ago

majabojarska commented 1 week ago

Environmental Info: K3s Version:

k3s version v1.31.4+k3s1 (a562d090)
go version go1.22.9

Node(s) CPU architecture, OS, and Version:

Linux [...] 6.10.2-rt14-arch1-3-rt #1 SMP PREEMPT_RT Sat, 14 Dec 2024 12:07:28 +0000 x86_64 GNU/Linux

Cluster Configuration:

Single node cluster (server+agent) - in my opinion not that relevant within the context of this report.

Describe the bug:

The default --etcd-snapshot-dir value reported by the server and etcd-snapshot K3S CLI command --help dialogues does not match the effective path created and used in runtime.

Take a look at the --help output for both commands:

k3s server --help | grep '\-dir'
   --data-dir value, -d value                 (data) Folder to hold state default /var/lib/rancher/k3s or ${HOME}/.rancher/k3s if not root [$K3S_DATA_DIR]
   --etcd-snapshot-dir value                  (db) Directory to save db snapshots. (default: ${data-dir}/db/snapshots)
# -------- SNIP --------

k3s etcd-snapshot --help | grep '\-dir'
   --data-dir value, -d value                                   (data) Folder to hold state default /var/lib/rancher/k3s or ${HOME}/.rancher/k3s if not root [$K3S_DATA_DIR]
   --dir value, --etcd-snapshot-dir value                       (db) Directory to save etcd on-demand snapshot. (default: ${data-dir}/db/snapshots)

Assuming --etcd-snapshot-dir is not provided, the effective path is actually ${data-dir}/server/db/snapshots, instead of ${data-dir}/db/snapshots (note the missing server path segment).

Steps To Reproduce:

To preface this section, I've initially observed this issue on a different system, running NixOS. The K3s service was installed and managed via the k3s nixpkg, obviously adding a layer of abstraction between the K3s distributables and the end user. To rule out the potential configuration skew, I've reproduced this on Arch via AUR, whose install process I understand better.

  1. Install v1.31.4+k3s1

  2. Add a minimal working example etcd snapshot configuration to the systemd service k3s server invocation. Just enough to enable etcd (instead of SQLite) and get a snapshot created quickly, without flooding the storage:

    • /usr/bin/k3s server --cluster-init --etcd-snapshot-schedule-cron="* * * * *" --etcd-snapshot-retention=1
    • See the "additional context" section below for the full systemd unit file.
  3. Ensure service is enabled and reload daemons to get the above configuration running:

    sudo system enable --now k3s.service
    sudo systemctl daemon-reload
    sudo systemctl status k3s.service
    
    ● k3s.service - Lightweight Kubernetes
         Loaded: loaded (/usr/lib/systemd/system/k3s.service; enabled; preset: disabled)
         Active: active (running) 
    # -------- SNIP --------
  4. Manually trigger an etcd snapshot via k3s etcd-snapshot

    sudo k3s etcd-snapshot save
    INFO[0000] Snapshot on-demand-machine-1736613048 saved.
  5. Wait a minute, for the next etcd snapshot cron schedule tick.

[!NOTE] The service is running as root, and therefore the effective default --data-dir directory is /var/lib/rancher/k3s.

Expected behavior:

The snapshots are saved under /var/lib/rancher/k3s/db/snapshots, since this is what k3s server --help told me.

Actual behavior:

The snapshots are saved under /var/lib/rancher/k3s/server/db/snapshots:

[root@machine k3s]# pwd
/var/lib/rancher/k3s
[root@machine k3s]# ls
agent  data  server # No db dir here
[root@machine k3s]# cd server/
[root@machine server]# ls
agent-token  cred  db  etc  kine.sock  manifests  node-token  static  tls  token
[root@machine server]# cd db/snapshots/
[root@machine snapshots]# ls
etcd-snapshot-machine-1736614203  on-demand-machine-1736613048

Additional context / logs:

I've looked in the sources, and the effective data dir appears to be resolved here, on server startup: https://github.com/k3s-io/k3s/blob/a562d090b05cf8d55b6a8b57556787c24c8ce21a/pkg/server/server.go#L466-L486 https://github.com/k3s-io/k3s/blob/a562d090b05cf8d55b6a8b57556787c24c8ce21a/pkg/server/server.go#L40-L42

The above applies to both the server and etcd-snapshot commands, since to my understanding, the snapshots invoked manually by etcd-snapshot save send a POST /db/snapshot to the server, which in turn calculates the snapshot write path, using the DataDir config value resolved on startup.

Full systemd unit file

[Unit]
Description=Lightweight Kubernetes
Documentation=https://k3s.io
After=network-online.target

[Service]
Type=notify
EnvironmentFile=/etc/systemd/system/k3s.service.env
ExecStartPre=-/sbin/modprobe br_netfilter
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/bin/k3s server --cluster-init --etcd-snapshot-schedule-cron="* * * * *" --etcd-snapshot-retention=1
KillMode=process
Delegate=yes
# Having non-zero Limits causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
# See https://github.com/k3s-io/k3s/commit/b4335630b78b5cf927e79724067803a6c0d7c04f
LimitNOFILE=1048576
LimitNPROC=infinity
LimitCORE=infinity
TasksMax=infinity
TimeoutStartSec=0
Restart=always
RestartSec=5s

[Install]
WantedBy=multi-user.target
majabojarska commented 1 week ago

Suffixing --data-dir with server, for server-originating artifacts totally makes sense, and imo it's the CLI help that needs to be updated. I'll submit a PR.

brandond commented 1 week ago

Merged to master, will backport in February cycle.