talosctl reset fails if /var/lib/etcd is on its own mount

siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.

https://www.talos.dev

Mozilla Public License 2.0

6.35k stars 511 forks source link

talosctl reset fails if /var/lib/etcd is on its own mount #6910

Open cjyar opened 1 year ago

cjyar commented 1 year ago

Bug Report

This node is a control plane node, and it has a dedicated device for /var/lib/etcd. When I run talosctl reset on that node, it fails with this message:

◱ watching nodes: [node01]
    * node01: 1 error(s) occurred:
    sequence error: sequence failed: error running phase 4 in reset sequence: task 1/1: failed, failed to leave cluster: failed to remove /var/lib/etcd: unlinkat /var/lib/etcd: device or resource busy

Environment

Talos version: v1.3.1
Kubernetes version: v1.26.0
Platform: linux/amd64

smira commented 1 year ago

Talos doesn't support custom mounts for /var/lib/etcd, in fact it's a bug that even allows to create such mount.

cjyar commented 1 year ago

I thought it was good practice to give etcd its own storage device so it doesn't have to fight with other things for iops. Thanks for clarifying Talos's position.

andrewrynhard commented 1 year ago

I thought it was good practice to give etcd its own storage device so it doesn't have to fight with other things for iops. Thanks for clarifying Talos's position.

I don't we are opposed to this, it is just that as we have it designed today this is considered a bug. There would be a decent amount of work to support a dedicated etcd disk and I don't think we are shutting that idea down entirely. We are all for best practices.

james-callahan commented 1 year ago

I ran into this same issue today when attempting to upgrade from 1.3.7 to 1.4.0:

* 10.16.144.4: 1 error(s) occurred:
    sequence error: sequence failed: error running phase 4 in upgrade sequence: task 1/1: failed, failed to leave cluster: failed to remove /var/lib/etcd: unlinkat /var/lib/etcd: device or resource busy

We use a unique volume for /var/lib/etcd as our controlplane nodes have no other persistent storage.

cjyar commented 1 year ago

The workaround is to talosctl etcd leave -n $NODE each node before it needs to reboot.

github-actions[bot] commented 2 months ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 7 days.

james-callahan commented 2 months ago