Collector for PVC disk usage

mnp commented 6 months ago

Describe the rationale for the suggested feature.

Troubleshoot collects PVC specs but not disk usage.

Describe the feature

K8s users can use a script like kubedf available here which calls the /api/v1/nodes API and collect capacity bytes, available bytes, and percent used. This algorithm would port cleanly to go for implementation as a collector, maybe call it "pvcDiskUsage"?

I imagine it would take optional namespace and optional pvc name (default=all). Note that not everyone knows all their PVC names ahead of time, sometimes they're dynamically created.

Describe alternatives you've considered

This can be done sometimes using the exec collector to shell into a pod which mounts the volume and run a df in that pod. However, pods which are "from scratch", et al, do not contain df so that's not always an option.
We could assemble a custom image containing kubedf, jq, and kubectl and run that with runPod. It would be better if it was builtin to troubleshoot.
I looked for a metrics API that would let the http collector pull it. That would be ideal also, but I didn't see one.
We could scrape this ourself at the app level and log it. Again, this is something many people want, probably and would be better if not app level.

Additional context

Our users create PVCs dynamically and when they fill up, it's a source of errors. A support bundle containing utilization metrics would be ideal.

mnp commented 5 months ago

This terrible hack, made possible by the recent addition of runDaemonSet, may be enough to tide us over. It scrapes the node volume metrics, which are in Prometheus format, and throws them in the collected logs. Some massaging would provide usage percentage.

      collectors:
        - runDaemonSet:
            name: "disk-usage"
            podSpec:
              containers:
                - name: metrics-scraper
                  image: busybox
                  command:
                    - "sh"
                    - "-c"
                    - "wget -q -O - ${NODE_IP}:10255/metrics | grep kubelet_volume_stats"
                  env:
                    - name: NODE_IP
                      valueFrom:
                        fieldRef:
                          fieldPath: status.hostIP

xavpaice commented 5 months ago

This might be a good place to use Prometheus and have that send an alert when space is looking like it's low.

Let's spend some time looking at that kubedf, to see what we can do with it.

mnp commented 5 months ago

Yes, good point @xavpaice - we had the same idea collecting into our prometheus (we try not to step on yours) and then retrieving a few key metrics from there. Someone else pointed out the kubedf method only works for certain CSI providers so nothing is both holistic and 100% here it seems.

banjoh commented 5 months ago

My thoughts here would be to collect all the node metrics with a new collector similar to the custom metrics one. In hindsight, we should have just created a k8s metrics collector to collect node and external metrics

The new collector would be something simple like below

apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
  name: sample
spec:
  collectors:
    - nodeMetrics: {}

We can then have a nodeMetrics analyser that report on various, starting with volume consumption at this stage, similar to the node resources analyser. It would something like below

Here is a sample pod stat

```json { "podRef": { "name": "local-path-provisioner-957fdf8bc-ftqgx", "namespace": "kube-system", "uid": "801b3090-6258-4086-b656-08358794d332" }, "startTime": "2024-03-27T14:13:46Z", "containers": [ { "name": "local-path-provisioner", "startTime": "2024-03-27T14:13:49Z", "cpu": { "time": "2024-03-27T16:33:39Z", "usageNanoCores": 599742, "usageCoreNanoSeconds": 5772217000 }, "memory": { "time": "2024-03-27T16:33:39Z", "usageBytes": 18866176, "workingSetBytes": 18866176, "rssBytes": 15581184, "pageFaults": 2830, "majorPageFaults": 0 }, "rootfs": { "time": "2024-03-27T16:33:46Z", "availableBytes": 47286603776, "capacityBytes": 62671097856, "usedBytes": 28672, "inodesFree": 3842677, "inodes": 3907584, "inodesUsed": 8 }, "logs": { "time": "2024-03-27T16:33:47Z", "availableBytes": 47286603776, "capacityBytes": 62671097856, "usedBytes": 16384, "inodesFree": 3842677, "inodes": 3907584, "inodesUsed": 1 } } ], "cpu": { "time": "2024-03-27T16:33:46Z", "usageNanoCores": 588027, "usageCoreNanoSeconds": 5783494000 }, "memory": { "time": "2024-03-27T16:33:46Z", "usageBytes": 19087360, "workingSetBytes": 19087360, "rssBytes": 15613952, "pageFaults": 3602, "majorPageFaults": 0 }, "network": { "time": "2024-03-27T16:33:42Z", "name": "eth0", "rxBytes": 428232, "rxErrors": 0, "txBytes": 111668, "txErrors": 0, "interfaces": [ { "name": "tunl0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "gre0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "gretap0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "erspan0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "ip_vti0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "ip6_vti0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "sit0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "ip6tnl0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "ip6gre0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "eth0", "rxBytes": 428232, "rxErrors": 0, "txBytes": 111668, "txErrors": 0 } ] }, "volume": [ { "time": "2024-03-27T16:32:17Z", "availableBytes": 47286743040, "capacityBytes": 62671097856, "usedBytes": 24576, "inodesFree": 3842689, "inodes": 3907584, "inodesUsed": 11, "name": "config-volume" }, { "time": "2024-03-27T16:32:17Z", "availableBytes": 8327376896, "capacityBytes": 8327389184, "usedBytes": 12288, "inodesFree": 1016518, "inodes": 1016527, "inodesUsed": 9, "name": "kube-api-access-hpzr7" } ], "ephemeral-storage": { "time": "2024-03-27T16:33:47Z", "availableBytes": 47286603776, "capacityBytes": 62671097856, "usedBytes": 73728, "inodesFree": 3842677, "inodes": 3907584, "inodesUsed": 21 }, "process_stats": { "process_count": 0 } }, ```

mnp commented 4 months ago

Thank you @banjoh - we'll be trying some of these new features asap!

banjoh commented 4 months ago

Here are the relevant docs

https://troubleshoot.sh/docs/collect/node-metrics/ https://troubleshoot.sh/docs/analyze/node-metrics/

replicatedhq / troubleshoot

Collector for PVC disk usage #1496