siderolabs / talos

Talos Linux is a modern Linux distribution built for Kubernetes.
https://www.talos.dev
Mozilla Public License 2.0
6.98k stars 565 forks source link

Talos reset with disk set does not work #3050

Closed salkin closed 3 years ago

salkin commented 3 years ago

Issuing talosctl reset: /usr/local/bin/talosctl --talosconfig out.yaml reset --graceful=false --reboot=true --system-labels-to-wipe=EPHEMERAL

Ends up with following error:

[  451.736428] [talos] service[bootkube](Finished): Service finished successfully
[  453.105287] [talos] updated initialization status in etcd
[  736.985736] IPv6: ADDRCONF(NETDEV_CHANGE): calie108abc4d09: link becomes ready
[  736.989420] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 1043.534740] IPv6: ADDRCONF(NETDEV_CHANGE): cali06b94b8cfa7: link becomes ready
[ 1043.538583] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 1073.403339] tipc: TX() has been purged, node left!
[ 9445.221223] [talos] reset request received
[ 9445.223667] [talos] reset sequence: 6 phase(s)
[ 9445.225923] [talos] phase stopEverything (1/6): 1 tasks(s)
[ 9445.228610] [talos] task stopAllServices (1/1): starting
[ 9445.231393] [talos] service[routerd](Stopping): Sending SIGTERM to task routerd (PID 2315, container routerd)
[ 9445.235958] [talos] service[kubelet](Stopping): Sending SIGTERM to task kubelet (PID 2646, container kubelet)
[ 9445.241309] [talos] service[etcd](Stopping): Sending SIGTERM to task etcd (PID 2531, container etcd)
[ 9445.245798] [talos] service[apid](Stopping): Sending SIGTERM to task apid (PID 2445, container apid)
[ 9445.250271] [talos] service[udevd](Stopping): Sending SIGTERM to Process(["/sbin/udevd" "--resolve-names=never"])
[ 9445.255430] [talos] service[trustd](Stopping): Sending SIGTERM to task trustd (PID 2438, container trustd)
[ 9445.260697] [talos] service[machined](Finished): Service finished successfully
[ 9445.269992] [talos] service[udevd](Finished): Service finished successfully
[ 9445.340323] [talos] service[apid](Finished): Service finished successfully
[ 9445.343640] [talos] service[routerd](Finished): Service finished successfully
[ 9445.353907] tipc: TX() has been purged, node left!
[ 9445.381672] [talos] service[trustd](Finished): Service finished successfully
[ 9445.423460] [talos] service[etcd](Finished): Service finished successfully
[ 9445.689308] [talos] service[kubelet](Finished): Service finished successfully
[ 9445.692863] [talos] service[cri](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/run/containerd/containerd.sock" "--config" "/etc/cri/containerd.toml"])
[ 9445.738064] [talos] service[cri](Finished): Service finished successfully
[ 9445.741452] [talos] service[networkd](Stopping): Sending SIGTERM to task networkd (PID 2302, container networkd)
[ 9445.817219] [talos] service[networkd](Finished): Service finished successfully
[ 9445.821008] [talos] service[containerd](Stopping): Sending SIGTERM to Process(["/bin/containerd" "--address" "/system/run/containerd/containerd.sock" "--state" "/system/run/containerd" "--root" "/system/var/lib/containerd"])
[ 9445.854591] [talos] service[containerd](Finished): Service finished successfully
[ 9445.858340] [talos] task stopAllServices (1/1): done, 629.745301ms
[ 9445.861226] [talos] phase stopEverything (1/6): done, 635.305916ms
[ 9445.864097] [talos] phase umount (2/6): 2 tasks(s)
[ 9445.866451] [talos] task unmountPodMounts (2/2): starting
[ 9445.869141] [talos] task unmountOverlayFilesystems (1/2): starting
[ 9445.870157] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/30efae0b-9ded-4139-a292-7475d2751192/volumes/kubernetes.io~secret/kube-proxy-token-lhdtc
[ 9445.879440] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/93f62d62-eb59-4f4b-bf58-d9993fe62c7e/volumes/kubernetes.io~secret/canal-token-5hwjx
[ 9445.886360] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/12b50860-a190-415b-982d-c35d09a10098/volumes/kubernetes.io~secret/edge-bootstrap-server-token-qfkdv
[ 9445.893549] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/189411b9-a872-4629-a298-63f1a8f5867a/volumes/kubernetes.io~secret/secrets
[ 9445.899655] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/189411b9-a872-4629-a298-63f1a8f5867a/volumes/kubernetes.io~secret/default-token-7r6fj
[ 9445.906447] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/f69a7bbf-bb87-4d57-b474-f6c3006a38c9/volumes/kubernetes.io~secret/kube-controller-manager-token-7cbd7
[ 9445.914160] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/f69a7bbf-bb87-4d57-b474-f6c3006a38c9/volumes/kubernetes.io~secret/secrets
[ 9445.921735] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/8176afa4-1093-41ba-97c1-b95749862440/volumes/kubernetes.io~secret/default-token-7r6fj
[ 9445.930463] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/e866ad46-7ad6-4471-a648-b42d76de556b/volumes/kubernetes.io~secret/pod-checkpointer-token-mslfg
[ 9445.939456] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/92dbd793-58b3-499b-9839-28dc345426e3/volumes/kubernetes.io~secret/coredns-token-jnklm
[ 9445.946968] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/dc8cf331-06a8-4863-a240-8e9a1b5a3451/volumes/kubernetes.io~secret/coredns-token-jnklm
[ 9445.954310] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/bef96344-7563-4cf1-ab21-56a6edaa8f9f/volumes/kubernetes.io~secret/gitops-operator-token-p2wmz
[ 9445.962084] [talos] task unmountPodMounts (2/2): unmounting /var/run/netns/cni-1a96f122-5095-33ea-e003-99c3ef390c1e
[ 9445.967270] [talos] task unmountPodMounts (2/2): unmounting /var/run/netns/cni-b83a6063-0b7a-d361-b672-07ae23b4c1b3
[ 9445.972410] [talos] task unmountPodMounts (2/2): unmounting /var/run/netns/cni-897b16be-7cdb-ea74-cc7d-db68efa609c4
[ 9445.977945] [talos] task unmountPodMounts (2/2): unmounting /var/run/netns/cni-0c85497f-5cfa-7437-7aca-4a13aa626d1d
[ 9445.983443] [talos] task unmountPodMounts (2/2): unmounting /var/run/netns/cni-c76319f3-261a-5282-ae73-80fc68827001
[ 9445.988746] [talos] task unmountPodMounts (2/2): unmounting /var/lib/kubelet/pods/a8045781-41b9-4a7a-9f09-aad9ce26574c/volumes/kubernetes.io~secret/default-token-w5xnr
[ 9445.996065] [talos] task unmountPodMounts (2/2): unmounting /var/run/netns/cni-de594dd4-c2a6-de3d-a812-03135f5b4375
[ 9446.001287] [talos] task unmountPodMounts (2/2): done, 134.859234ms
[ 9446.013006] [talos] task unmountOverlayFilesystems (1/2): done, 146.528359ms
[ 9446.016372] [talos] phase umount (2/6): done, 152.275251ms
[ 9446.018964] [talos] phase unmountSystem (3/6): 2 tasks(s)
[ 9446.021553] [talos] task unmountStatePartition (2/2): starting
[ 9446.024377] [talos] task unmountEphemeralPartition (1/2): starting
[ 9446.332094] [talos] task unmountStatePartition (2/2): done, 310.547642ms
[ 9446.335273] [talos] task unmountEphemeralPartition (1/2): done, 310.492067ms
[ 9446.338532] [talos] phase unmountSystem (3/6): done, 319.565192ms
[ 9446.341378] [talos] phase unmountBind (4/6): 1 tasks(s)
[ 9446.343947] [talos] task unmountSystemDiskBindMounts (1/1): starting
[ 9446.347314] [talos] task unmountSystemDiskBindMounts (1/1): unmounting /etc/cri/containerd.toml
[ 9446.352028] [talos] task unmountSystemDiskBindMounts (1/1): done, 8.115088ms
[ 9446.355286] [talos] phase unmountBind (4/6): done, 13.909051ms
[ 9446.358037] [talos] phase reset (5/6): 1 tasks(s)
[ 9446.360393] [talos] task resetSystemDisk (1/1): starting
[ 9456.425141] [talos] task resetSystemDisk (1/1): failed
[ 9456.427749] [talos] task resetSystemDisk (1/1): done, 10.067367272s
[ 9456.430692] [talos] phase reset (5/6): failed
[ 9456.432877] [talos] reset sequence: failed
[ 9456.434984] [talos] reset sequence: done: 11.211324712s
[ 9456.437496] [talos] reset failed: error running phase 5 in reset sequence: task 1/1: failed, failed to sync kernel partitions: failed to inform kernel: 2 error(s) occurred:
[ 9456.444657]  device or resource busy
[ 9456.446529]  timeout
[ 9456.448076] [talos] fatal sequencer error in "reset" sequence: message:"sequence failed: error running phase 5 in reset sequence: task 1/1: failed, failed to sync kernel partitions: failed to inform kernel: 2 error(s) occurred:\n\tdevice or resource busy\n\ttimeout"
[ 9456.508024] [talos] failed to open meta: file does not exist
[ 9456.510779] [talos] rebooting in 10 seconds
[ 9457.513227] [talos] rebooting in 9 seconds
[ 9458.515864] [talos] rebooting in 8 seconds
[ 9459.518417] [talos] rebooting in 7 seconds
[ 9460.546108] [talos] rebooting in 6 seconds
[ 9461.548642] [talos] rebooting in 5 seconds
[ 9462.551183] [talos] rebooting in 4 seconds
[ 9463.553832] [talos] rebooting in 3 seconds
[ 9464.557046] [talos] rebooting in 2 seconds
[ 9465.559781] [talos] rebooting in 1 seconds
[ 9466.562273] [talos] rebooting in 0 seconds
[ 9467.566538] [talos] waiting for sync...
[ 9467.594710] [talos] sync done
[ 9467.597202] kvm: exiting hardware virtualization
[ 9467.668035] reboot: Restarting system
[ 9467.670013] reboot: machine restart

Talos version: 0.8.1

smira commented 3 years ago

@salkin this is definitely bug in Talos, can you please share extra manifests you're using so that I could reproduce?

smira commented 3 years ago

ok, looks like it might be Talos 0.7.x, not 0.8.x