rook / kubectl-rook-ceph

Krew plugin to run kubectl commands with rook-ceph
Apache License 2.0
74 stars 26 forks source link

Add krew command to debugging csi #69

Open Madhu-1 opened 1 year ago

Madhu-1 commented 1 year ago

This can be done by doing curl/ping or executing ceph commands from from cephfs/rbd-polugin container of provisioner and daemonset pods.

If the PVC is not attaching to a give pod, we can identify to which node the pod is scheduled and on that node/rbdplugin pod we can run dmesg and print logs which helps for debuggin

More details about what command to run and where to run is documented here and here

We might have two provisioner pods running, sometime for new comers its hard to find leader pod and which container to pull the logs from , we can make simple helper command which helps in pulling the required logs

This helps sometime where csi is not getting expected results but still admin wants to manually mount the rbd image or umount the image or run rados commands or ceph fs commands

This is a big topic and need lot of automation but this is really helpful and also at the same time its a dangerous command. i will provide more details when we start working on this one.

Printing the kernel version from the nodes where cephfs/rbd plugin pod runs, this helps in some debug cases

some details here we might also need some command to remove watcher also.

travisn commented 1 year ago

Before we implement these commands, we need a design about what will be most helpful for troubleshooting csi issues. For example, if the tool could help with questions such as:

And in the output of the tool, do we really want to retrieve container logs or get long dmesg output? I wonder if we should do something more basic like print suggestions for where to run dmesg, print which provisioners are the leaders (with the pod names) so they can look at those logs, or print that there are ceph health issues that would prevent volumes from working. Or if we're going to get full logs, do we dump them in some directory, or where do we put them?

For ceph health, we could print status such as whether mons are in quorum, whether there are OSDs, whether there are PGs unhealthy, whether all the expected csi pods are running, and so on.

What about some overall functions like this?