Add krew command to debugging csi

[x] Verify network connectivity from csi pods and ceph cluster

This can be done by doing curl/ping or executing ceph commands from from cephfs/rbd-polugin container of provisioner and daemonset pods.

[x] Command to get dmesg logs from the node based on the pod name or pvc name

If the PVC is not attaching to a give pod, we can identify to which node the pod is scheduled and on that node/rbdplugin pod we can run dmesg and print logs which helps for debuggin

[ ] Command to check any stale map or mount commands in csi plugin pods

More details about what command to run and where to run is documented here and here

[ ] Command to pull required logs from the leader provisioner and sidecar container based on the pod name or the pvc name

We might have two provisioner pods running, sometime for new comers its hard to find leader pod and which container to pull the logs from , we can make simple helper command which helps in pulling the required logs

For pvc create/delete issue pull logs from csi-provisioner and csi-rbdplugin/csi-cephfsplugin container
For snapshot create/delete issue pull logs from csi-snapshotter and csi-rbdplugin/csi-cephfsplugin container
For resize issue pull logs from csi-resizer and csi-rbdplugin/csi-cephfsplugin container etc
[ ] Command to run ceph commands from provisioner or nodeplugin containers

This helps sometime where csi is not getting expected results but still admin wants to manually mount the rbd image or umount the image or run rados commands or ceph fs commands

[ ] Command to identify and cleanup stale resources (rbd images/subvolumes/omap)

This is a big topic and need lot of automation but this is really helpful and also at the same time its a dangerous command. i will provide more details when we start working on this one.

[ ] Print kernel version of all the plugin pod nodes

Printing the kernel version from the nodes where cephfs/rbd plugin pod runs, this helps in some debug cases

[ ] Recover from Node lost cases

some details here we might also need some command to remove watcher also.

Before we implement these commands, we need a design about what will be most helpful for troubleshooting csi issues. For example, if the tool could help with questions such as:

Why is my PVC unbound?
Why is my volume not mounting?
Is my cluster health affecting csi provisioning?

And in the output of the tool, do we really want to retrieve container logs or get long dmesg output? I wonder if we should do something more basic like print suggestions for where to run dmesg, print which provisioners are the leaders (with the pod names) so they can look at those logs, or print that there are ceph health issues that would prevent volumes from working. Or if we're going to get full logs, do we dump them in some directory, or where do we put them?

For ceph health, we could print status such as whether mons are in quorum, whether there are OSDs, whether there are PGs unhealthy, whether all the expected csi pods are running, and so on.

What about some overall functions like this?

kubectl rook-ceph csi health: Overall csi health
kubectl rook-ceph csi health <pvc>: Info about why a specific pvc might not be mounting and which logs might help troubleshoot
kubectl rook-ceph csi blocklist <node>: Block a node that is down to allow the PVs on that node to move to another node.

rook / kubectl-rook-ceph

Add krew command to debugging csi #69