rancher / rke2

https://docs.rke2.io/
Apache License 2.0
1.58k stars 269 forks source link

rke2 start fails after rke2-killall.sh execution #6571

Open mynktl opened 3 months ago

mynktl commented 3 months ago

Environmental Info: RKE2 Version:

rke2 version v1.30.1+rke2r1 (e7f87c6dd56fdd76a7dab58900aeea8946b2c008) go version go1.22.2 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

Linux server0 4.18.0-372.41.1.el8_6.x86_64 #1 SMP Thu Jan 5 13:56:06 EST 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

1 server

Describe the bug:

For RKE2, We are using two mount points, one for etcd database and other for rke2. structure of /var/lib/rancher directory is as below:

[root@server0 rancher]# tree /var/lib/rancher/ -L 3
/var/lib/rancher/
`-- rke2
    |-- agent
    |   |-- client-ca.crt
    |   |-- client-kubelet.crt
    |   |-- client-kubelet.key
    |   |-- client-kube-proxy.crt
    |   |-- client-kube-proxy.key
    |   |-- client-rke2-controller.crt
    |   |-- client-rke2-controller.key
    |   |-- containerd
    |   |-- etc
    |   |-- images
    |   |-- kubelet.kubeconfig
    |   |-- kubeproxy.kubeconfig
    |   |-- logs
    |   |-- pod-manifests
    |   |-- rke2controller.kubeconfig
    |   |-- server-ca.crt
    |   |-- serving-kubelet.crt
    |   `-- serving-kubelet.key
    |-- bin -> /var/lib/rancher/rke2/data/v1.30.1-rke2r1-c42b85364830/bin
    |-- data
    |   `-- v1.30.1-rke2r1-c42b85364830
    `-- server
        |-- agent-token -> /var/lib/rancher/rke2/server/token
        |-- cred
        |-- db
        |-- etc
        |-- manifests
        |-- node-token -> /var/lib/rancher/rke2/server/token
        |-- tls
        `-- token

16 directories, 16 files

and mount configuration is as below:

[root@server0 rancher]# lsblk
NAME                          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                             8:0    0  128G  0 disk
|-sda1                          8:1    0  500M  0 part /boot
|-sda2                          8:2    0   63G  0 part
| |-rootvg-tmplv              253:11   0    2G  0 lvm  /tmp
| |-rootvg-usrlv              253:12   0   10G  0 lvm  /usr
| |-rootvg-homelv             253:13   0    1G  0 lvm  /home
| |-rootvg-varlv              253:14   0    8G  0 lvm  /var
| `-rootvg-rootlv             253:15   0   22G  0 lvm  /
|-sda14                         8:14   0    4M  0 part
`-sda15                         8:15   0  495M  0 part /boot/efi
sdb                             8:16   0   16G  0 disk
`-uipathetcdvg-etcdlv         253:10   0   16G  0 lvm  /var/lib/rancher/rke2/server/db
sdd                             8:48   0  256G  0 disk
|-uipathvg-rancherlv          253:1    0  185G  0 lvm  /var/lib/rancher
|-uipathvg-kubeletlv          253:2    0   56G  0 lvm  /var/lib/kubelet
sdf                             8:80   0   32G  0 disk

To mount /var/lib/rancher/rke2/server/db automatically, We have added a dependency of this db mount to rke2-server.service.

[root@server0 rancher]# cat /etc/systemd/system/rke2-server.service.d/custom.conf
[Unit]
After=var-lib-rancher-rke2-server-db.mount var-lib-rancher.mount var-lib-kubelet.mount datadisk-insights.mount datadisk-monitoring.mount datadisk-objectstore.mount datadisk-registry.mount
Requires=var-lib-rancher-rke2-server-db.mount var-lib-rancher.mount var-lib-kubelet.mount datadisk-insights.mount datadisk-monitoring.mount datadisk-objectstore.mount datadisk-registry.mount

So whenever systemctl start rke2-server is executed it will perform the db mount and start rke2 server.


Issue: When we execute rke2-killall.sh script, it unmount /var/lib/rancher/rke2/server/db and delete this directory. As We don't have mount point specific to /var/lib/rancher/rke2, do_unmount_and_remove performs the action on /var/lib/rancher/rke2/server/db because of the condition grep "^$1" in do_unmount_and_remove function.

Post this execution, rke2 start via systemctl command fails, with below error

-- Unit var-lib-rancher-rke2-server-db.mount has begun starting up.
Aug 15 02:48:25 server0 mount[1527607]: mount: /var/lib/rancher/rke2/server/db: mount point does not exist.
Aug 15 02:48:25 server0 systemd[1]: var-lib-rancher-rke2-server-db.mount: Mount process exited, code=exited status=32
Aug 15 02:48:25 server0 systemd[1]: var-lib-rancher-rke2-server-db.mount: Failed with result 'exit-code'.
-- Subject: Unit failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- The unit var-lib-rancher-rke2-server-db.mount has entered the 'failed' state with result 'exit-code'.
Aug 15 02:48:25 server0 systemd[1]: Failed to mount /var/lib/rancher/rke2/server/db.
-- Subject: Unit var-lib-rancher-rke2-server-db.mount has failed
-- Defined-By: systemd
-- Support: https://access.redhat.com/support
--
-- Unit var-lib-rancher-rke2-server-db.mount has failed.
--
-- The result is failed.

This behaviour is with selinux enabled.

[root@server0 rancher]# getenforce
Enforcing

As we have separate mount point for etcd db, We have one more risk of etcd data getting deleted. In rke2-killall.sh script, umount will always go through as this is local mount point, but missing set -e. If amount fails then rm -rf --one-file-system ${MOUNTS} will delete the content of the directory, which is unexpected.

Steps To Reproduce:

Expected behavior:

Actual behavior:

Additional context / logs:

rajivml commented 3 months ago

cc @brandond

brandond commented 3 months ago

This sounds like a duplicate of https://github.com/rancher/rke2/issues/6557

mynktl commented 3 months ago

@brandond As we have removed the do_unmount_and_remove for rke2 directory in https://github.com/vitorsavian/rke2/blob/587fb7f22469a4827ea9040b36bfb23d14f9d0c5/bundle/bin/rke2-killall.sh, it will fix the problem for rke2 related directories. But do you see any need to have error handling or set -e in do_unmount_and_remove?

We are unmounting the directory and then removing that directory to clear the mount point. But if unmount fails then it will remove the content of directory which may result into data loss.

brandond commented 3 months ago

if unmount fails then it will remove the content of directory which may result into data loss.

That shouldn't be the case, we use rm -rf --one-file-system ${MOUNTS} which would not traverse across the filesystem boundary into the path that failed to unmount.

If you believe you've having problems with this not working as intended, please provide steps to reproduce.

maxlillo commented 3 months ago
  1. cd /var/lib/rancher/rke2/server/db
  2. run bash -x rke2-killall.sh

Relevant output below:

This was done with was ran with the version before /var/lib/rancher/rke2 was removed.

But seems like a good idea to do error checking in case things like this happened that you don't expect. You never know.

brandond commented 3 months ago

I don't think any changes are necessary. Please test on the release that no longer cleans up mounts under /var/lib/rancher/rke2.

maxlillo commented 3 months ago

Why do you think adding error checking is unnecessary? Is there some concern or do you have some coding standard or style guide you are adhering too?

The issue was not really rm -rf --one-file-system ${MOUNTS}. In your code you execute

MOUNTS=
while read ignore mount ignore; do
        MOUNTS="${mount}\n${MOUNTS}"
done </proc/self/mounts
MOUNTS=$(printf ${MOUNTS} | grep "^$1" | sort -r)

The last command results in MOUNTS being a collection of directories. I believe the purpose of that command is to make sure you remove the mounts in order.

echo $MOUNTS /var/lib/rancher/rke2/server/db /var/lib/rancher/rke2

Therefore when rm is executed its running: rm -rf --one-file-system /var/lib/rancher/rke2/server/db which would remove the content of the db directory since the prior unmount command failed.

We found this issue because a customer ran it and it deleted there etcd db on the node. They were shutting down all nodes but luckily this only happened on the one node. If the script had error checking to begin with this would have never happened.

In the current iteration, no commands catch my eye as unsafe, but I am always surprised by what I miss. And perhaps in the future something bad will get introduced.

I checked some online style guides online and this seems to be a best practice:

  1. https://google.github.io/styleguide/shellguide.html#calling-commands
  2. https://tldp.org/LDP/abs/html/external.html - Could not find it explicitly but in the examples they all have error checking
maxlillo commented 3 months ago

@brandond there are still problems with the new script. It was update to not remove "${RKE2_DATA_DIR}"

But we use topolvm and it mounts different resources under /var/lib/kubelet/pods. (Not sure if other csi plugins do this)

In the same way we saw the etcd accidently get deleted we see the content of some of our stateful set PVCs get deleted.

rm: cannot remove '/var/lib/kubelet/pods/f891db50-3a58-4ba2-a1e7-cb7c90a42741/volumes/kubernetes.io~local-volume/insights-lookerdir-pv-autosuitea': Device or resource busy rm: cannot remove '/var/lib/kubelet/pods/f891db50-3a58-4ba2-a1e7-cb7c90a42741/volumes/kubernetes.io~local-volume/insights-looker-datadir-pv-autosuitea': Device or resource busy

The content of these PVs was fully deleted.

Simply adding an error check to the whole operation would fix this. I would rather not see this in the wild like we saw with etcd getting deleted.

brandond commented 3 months ago

I'll reopen this for follow-up.

If you can provide any steps for our QA team to reproduce the issue that would be appreciated. What are you using that fails to unmount, but does allow files to be removed?

maxlillo commented 2 months ago

@brandond I am not sure what originally caused the issue with etcd. My guess is that they have some sort of security scanning going on.

I think you could simulate this by making a host mount in a container, navigating to that directory under /var/lib/pod and then running rke2-killall.sh. Because the working directory of the shell is one of the folders being unmounted, the unmount command fails. If that does not work then you would need to isntall a CSI like topolvm.

We have not seen one of the PVs get wiped in a customer environment. But as part of our RCA we ask the question of "Can this happen again? What could have prevented this?" etc.

Since error checking was not added to the script, it was out conclusion that yes this could happen again and we probably cannot predict all the edge cases where it could happen. But loosing data in this manner is unacceptable and preventable.

Just adding set -e at the start would prevent this and as far as I am aware adding an error check is common best practice.

github-actions[bot] commented 1 month ago

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

VestigeJ commented 1 week ago

@mynktl are you capable of testing this script change in your unique environment? My preliminary tests aren't yielding the same issues that you encountered and I'm wondering if it's related to the CSI you mentioned deploying.