Closed vadlakiran closed 10 months ago
attached the operator.yaml file as txt format please check and let me know is it refering any local storage
The drain error can safely be ignored for these pods that are stateless. The error is complaining about the hostPath that the pods are using, but it is not a concern for the drain, as they are just logs. rook-ceph/csi-cephfsplugin-provisioner-5f459cfb94-g6ccv, rook-ceph/csi-rbdplugin-provisioner-84c7fb8d76-d4d4b, rook-ceph/rook-ceph-mds-myfs-b-6cc4cfff96-lwlpw, rook-ceph/rook-ceph-mgr-a-7d58c74554-r5pz7, rook-ceph/rook-ceph-tools-79957dbdb7-rtfww
@travisn due to above hostPath that the pods are using we are not able to drain the node. if we are draining the node, all rook-ceph pods will schedule another available node ? if i want to remove the worker node from k8s cluster, worker node is part of the rook-ceph disk i.e osd, how we can stop or move the pods which are running on worker node(cordon and drain worker node) to another available worker node ?
@travisn due to above hostPath that the pods are using we are not able to drain the node. if we are draining the node, > all rook-ceph pods will schedule another available node ?
yes
if i want to remove the worker node from k8s cluster, worker node is part of the rook-ceph disk i.e osd, how we can stop or move the pods which are running on worker node(cordon and drain worker node) to another available worker node ?
Rook should take care of moving pods to another available node once you drain a node. With respect to storage, you should be good as long as your drain one node at a time. And only drain the other node once the OSD is back and data has re balanced (pgs are active+clean)
@sp98 Thank you I have tried to drain with below option kubectl drain worker2 --ignore-daemonsets --delete-emptydir-data
then we are getting below error message, and node is not drain , rook-ceph-mon-e-f96ff7588-nnwm9 pod is running on worker2, why its not scheduling other available worker node ?
` error when evicting pods/"rook-ceph-mon-e-f96ff7588-nnwm9" -n "rook-ceph" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
evicting pod rook-ceph/rook-ceph-mon-e-f96ff7588-nnwm9
error when evicting pods/"rook-ceph-mon-e-f96ff7588-nnwm9" -n "rook-ceph" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget. `
error when evicting pods/"rook-ceph-mon-e-f96ff7588-nnwm9" -n "rook-ceph" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
We have Pod disruption budget applied on mons. This allows only one mon pod to be drained at a time.
how many mon pods are running when you run kubectl drain worker2 --ignore-daemonsets --delete-emptydir-data
? (can you share the output of kubectl get pods -n rook-ceph
?
And also, the output of kubectl get pdb -n rook-ceph
~$ sudo kubectl get pods -n rook-ceph -owide
NAME READY STATUS RESTARTS AGE NODE NOMINATED NODE READINESS GATES
csi-cephfsplugin-9926x 3/3 Running 18 118d worker3 <none> <none>
csi-cephfsplugin-bpz2x 3/3 Running 24 118d worker2 <none> <none>
csi-cephfsplugin-pr9mq 3/3 Running 30 118d worker1 <none> <none>
csi-cephfsplugin-provisioner-5f459cfb94-dqxw8 6/6 Running 0 55m worker1 <none> <none>
csi-cephfsplugin-provisioner-5f459cfb94-zf7pj 6/6 Running 24 90d worker3 <none> <none>
csi-rbdplugin-f58nm 3/3 Running 18 118d worker3 <none> <none>
csi-rbdplugin-nq2wm 3/3 Running 24 118d worker2 <none> <none>
csi-rbdplugin-nr7fb 3/3 Running 30 118d worker1 <none> <none>
csi-rbdplugin-provisioner-84c7fb8d76-77gbk 6/6 Running 24 90d worker3 <none> <none>
csi-rbdplugin-provisioner-84c7fb8d76-dlprv 6/6 Running 0 55m worker1 <none> <none>
rook-ceph-crashcollector-jmapworker1-8fc9c8496-d5876 1/1 Running 0 11d worker1 <none> <none>
rook-ceph-crashcollector-jmapworker2-798bdbd764-xcsz9 0/1 Pending 0 55m <none> <none> <none>
rook-ceph-crashcollector-jmapworker3-f5b4b4f45-hhnwc 1/1 Running 4 90d worker3 <none> <none>
rook-ceph-mds-myfs-a-7598fc97b4-r9znp 2/2 Running 34 90d worker3 <none> <none>
rook-ceph-mds-myfs-b-6cc4cfff96-dk9zp 2/2 Running 0 55m worker1 <none> <none>
rook-ceph-mgr-a-7d58c74554-lhwvs 2/2 Running 0 55m worker3 <none> <none>
rook-ceph-mon-b-755464db8f-jglvs 2/2 Running 8 93d worker3 <none> <none>
rook-ceph-mon-d-c78846766-ls6qb 0/2 Pending 0 55m <none> <none> <none>
rook-ceph-mon-e-f96ff7588-nnwm9 2/2 Running 0 11d worker2 <none> <none>
rook-ceph-osd-1-65fc776d98-5qp5f 2/2 Running 0 11d worker1 <none> <none>
rook-ceph-osd-2-5b6c9f5947-hh965 2/2 Running 8 94d worker3 <none> <none>
rook-ceph-osd-prepare-jmapworker1-zthl9 0/1 Completed 0 9d worker1 <none> <none>
rook-ceph-osd-prepare-jmapworker3-qgwl2 0/1 Completed 0 9d worker3 <none> <none>
rook-ceph-tools-79957dbdb7-tq56f 1/1 Running 0 55m worker1 <none> <none>
===================
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
rook-ceph-mds-myfs 1 N/A 1 118d
rook-ceph-mon-pdb N/A 1 0 118d
rook-ceph-osd N/A 1 1 11d
@vadlakiran
Based the output you have shared, it seems that one of your mon pod ook-ceph-mon-d-c78846766-ls6qb
is already in pending state.
Thats the reason you are not able to drain the node with mon-e and you are getting the message error when evicting pods/"rook-ceph-mon-e-f96ff7588-nnwm9" -n "rook-ceph" (will retry after 5s): Cannot evict pod as it would violate the pod's disruption budget.
@sp98 if i delete the pending pod, this will resolve the issue ?
It won't. You need 3 mons running for quorum. Currently you have only 2. So I would strongly advice to fix the mon quorum before draining any other node.
So need to figure out why this mon pod was not assigned to any node. You can check kubectl describe pod-name -n rook-ceph
to see why this pod was not assigned to any node.
@vadlakiran any luck finding out why rook-ceph-mon-d-c78846766-ls6qb
pod is pending?
@sp98 yes i have deleted deployment which is pending status pod, then i am able to drain the node
Thank you
@vadlakiran can we close this issue?
@vadlakiran looks like your issue is fixed. So closing this for now. Please reopen if you think issue is not resolved yet.
we have a master and 3 worker nodes running on k8s cluster and in that we have deployed the rook-ceph and we are using 3 worker nodes disk i.e 3 osd's from 3 worker nodes,
as part of the osd's or worker node replacement we are trying to cordon and drain the worker node(before drain we have removed the osd from ceph cluster followed the doc).
while draining the worker node we are getting below issue.
`~$ sudo kubectl drain worker2 --ignore-daemonsets node/worker2 already cordoned error: unable to drain node "worker2", aborting command...
There are pending nodes to be drained: worker2 error: cannot delete Pods with local storage (use --delete-emptydir-data to override): rook-ceph/csi-cephfsplugin-provisioner-5f459cfb94-g6ccv, rook-ceph/csi-rbdplugin-provisioner-84c7fb8d76-d4d4b, rook-ceph/rook-ceph-mds-myfs-b-6cc4cfff96-lwlpw, rook-ceph/rook-ceph-mgr-a-7d58c74554-r5pz7, rook-ceph/rook-ceph-tools-79957dbdb7-rtfww vmauser@master1:~$`
Is this a bug report or feature request?
Deviation from expected behavior:
Expected behavior:
How to reproduce it (minimal and precise):
File(s) to submit:
cluster.yaml
, if necessaryLogs to submit:
Crashing pod(s) logs, if necessary
To get logs, use
kubectl -n <namespace> logs <pod name>
When pasting logs, always surround them with backticks or use theinsert code
button from the Github UI. Read GitHub documentation if you need help.Cluster Status to submit:
Output of kubectl commands, if necessary
To get the health of the cluster, use
kubectl rook-ceph health
To get the status of the cluster, usekubectl rook-ceph ceph status
` ceph status cluster: id: 00770d4b-045a-4102-bd24-6855c936ad7d health: HEALTH_WARN Degraded data redundancy: 119492/358476 objects degraded (33.333%), 64 pgs degraded, 65 pgs undersized 24 pgs not deep-scrubbed in time OSD count 2 < osd_pool_default_size 3
services: mon: 3 daemons, quorum b,d,e (age 10d) mgr: a(active, since 3w) mds: 1/1 daemons up, 1 hot standby osd: 2 osds: 2 up (since 8d), 2 in (since 8d)
data: volumes: 1/1 healthy pools: 3 pools, 65 pgs objects: 119.49k objects, 1.6 GiB usage: 4.7 GiB used, 495 GiB / 500 GiB avail pgs: 119492/358476 objects degraded (33.333%) 64 active+undersized+degraded 1 active+undersized
io: client: 853 B/s rd, 2 op/s rd, 0 op/s wr ` For more details, see the Rook kubectl Plugin
Environment:
OS (e.g. from /etc/os-release): NAME="Ubuntu" VERSION="18.04.6 LTS (Bionic Beaver)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 18.04.6 LTS" VERSION_ID="18.04"
Kernel (e.g.
uname -a
): Linux master1 4.15.0-202-generic #213-Ubuntu SMP Thu Jan 5 19:19:12 UTC 2023 x86_64 x86_64 x86_64 GNU/LinuxCloud provider or hardware configuration: VM's
Rook version (use
rook version
inside of a Rook Pod): rook: v1.8.1Storage backend version (e.g. for ceph do
ceph -v
): ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)Kubernetes version (use
kubectl version
): v1.23.5Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): VM;s, on premesis k8s cluster
Storage backend status (e.g. for Ceph use
ceph health
in the Rook Ceph toolbox):