Open Madhu-1 opened 1 year ago
Before we implement these commands, we need a design about what will be most helpful for troubleshooting csi issues. For example, if the tool could help with questions such as:
And in the output of the tool, do we really want to retrieve container logs or get long dmesg output? I wonder if we should do something more basic like print suggestions for where to run dmesg, print which provisioners are the leaders (with the pod names) so they can look at those logs, or print that there are ceph health issues that would prevent volumes from working. Or if we're going to get full logs, do we dump them in some directory, or where do we put them?
For ceph health, we could print status such as whether mons are in quorum, whether there are OSDs, whether there are PGs unhealthy, whether all the expected csi pods are running, and so on.
What about some overall functions like this?
kubectl rook-ceph csi health
: Overall csi healthkubectl rook-ceph csi health <pvc>
: Info about why a specific pvc might not be mounting and which logs might help troubleshoot kubectl rook-ceph csi blocklist <node>
: Block a node that is down to allow the PVs on that node to move to another node.
This can be done by doing curl/ping or executing ceph commands from from cephfs/rbd-polugin container of provisioner and daemonset pods.
If the PVC is not attaching to a give pod, we can identify to which node the pod is scheduled and on that node/rbdplugin pod we can run dmesg and print logs which helps for debuggin
More details about what command to run and where to run is documented here and here
We might have two provisioner pods running, sometime for new comers its hard to find leader pod and which container to pull the logs from , we can make simple helper command which helps in pulling the required logs
For pvc create/delete issue pull logs from csi-provisioner and csi-rbdplugin/csi-cephfsplugin container
For snapshot create/delete issue pull logs from csi-snapshotter and csi-rbdplugin/csi-cephfsplugin container
For resize issue pull logs from csi-resizer and csi-rbdplugin/csi-cephfsplugin container etc
[ ] Command to run ceph commands from provisioner or nodeplugin containers
This helps sometime where csi is not getting expected results but still admin wants to manually mount the rbd image or umount the image or run rados commands or ceph fs commands
This is a big topic and need lot of automation but this is really helpful and also at the same time its a dangerous command. i will provide more details when we start working on this one.
Printing the kernel version from the nodes where cephfs/rbd plugin pod runs, this helps in some debug cases
some details here we might also need some command to remove watcher also.