replicatedhq / troubleshoot

Preflight Checks and Support Bundles Framework for Kubernetes Applications
https://troubleshoot.sh
Apache License 2.0
543 stars 92 forks source link

Collector for PVC disk usage #1496

Closed mnp closed 4 months ago

mnp commented 6 months ago

Describe the rationale for the suggested feature.

Troubleshoot collects PVC specs but not disk usage.

Describe the feature

K8s users can use a script like kubedf available here which calls the /api/v1/nodes API and collect capacity bytes, available bytes, and percent used. This algorithm would port cleanly to go for implementation as a collector, maybe call it "pvcDiskUsage"?

I imagine it would take optional namespace and optional pvc name (default=all). Note that not everyone knows all their PVC names ahead of time, sometimes they're dynamically created.

Describe alternatives you've considered

Additional context

Our users create PVCs dynamically and when they fill up, it's a source of errors. A support bundle containing utilization metrics would be ideal.

mnp commented 5 months ago

This terrible hack, made possible by the recent addition of runDaemonSet, may be enough to tide us over. It scrapes the node volume metrics, which are in Prometheus format, and throws them in the collected logs. Some massaging would provide usage percentage.

      collectors:
        - runDaemonSet:
            name: "disk-usage"
            podSpec:
              containers:
                - name: metrics-scraper
                  image: busybox
                  command:
                    - "sh"
                    - "-c"
                    - "wget -q -O - ${NODE_IP}:10255/metrics | grep kubelet_volume_stats"
                  env:
                    - name: NODE_IP
                      valueFrom:
                        fieldRef:
                          fieldPath: status.hostIP
xavpaice commented 5 months ago

This might be a good place to use Prometheus and have that send an alert when space is looking like it's low.

Let's spend some time looking at that kubedf, to see what we can do with it.

mnp commented 5 months ago

Yes, good point @xavpaice - we had the same idea collecting into our prometheus (we try not to step on yours) and then retrieving a few key metrics from there. Someone else pointed out the kubedf method only works for certain CSI providers so nothing is both holistic and 100% here it seems.

banjoh commented 5 months ago

My thoughts here would be to collect all the node metrics with a new collector similar to the custom metrics one. In hindsight, we should have just created a k8s metrics collector to collect node and external metrics

The new collector would be something simple like below

apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
  name: sample
spec:
  collectors:
    - nodeMetrics: {}

We can then have a nodeMetrics analyser that report on various, starting with volume consumption at this stage, similar to the node resources analyser. It would something like below

Here is a sample pod stat ```json { "podRef": { "name": "local-path-provisioner-957fdf8bc-ftqgx", "namespace": "kube-system", "uid": "801b3090-6258-4086-b656-08358794d332" }, "startTime": "2024-03-27T14:13:46Z", "containers": [ { "name": "local-path-provisioner", "startTime": "2024-03-27T14:13:49Z", "cpu": { "time": "2024-03-27T16:33:39Z", "usageNanoCores": 599742, "usageCoreNanoSeconds": 5772217000 }, "memory": { "time": "2024-03-27T16:33:39Z", "usageBytes": 18866176, "workingSetBytes": 18866176, "rssBytes": 15581184, "pageFaults": 2830, "majorPageFaults": 0 }, "rootfs": { "time": "2024-03-27T16:33:46Z", "availableBytes": 47286603776, "capacityBytes": 62671097856, "usedBytes": 28672, "inodesFree": 3842677, "inodes": 3907584, "inodesUsed": 8 }, "logs": { "time": "2024-03-27T16:33:47Z", "availableBytes": 47286603776, "capacityBytes": 62671097856, "usedBytes": 16384, "inodesFree": 3842677, "inodes": 3907584, "inodesUsed": 1 } } ], "cpu": { "time": "2024-03-27T16:33:46Z", "usageNanoCores": 588027, "usageCoreNanoSeconds": 5783494000 }, "memory": { "time": "2024-03-27T16:33:46Z", "usageBytes": 19087360, "workingSetBytes": 19087360, "rssBytes": 15613952, "pageFaults": 3602, "majorPageFaults": 0 }, "network": { "time": "2024-03-27T16:33:42Z", "name": "eth0", "rxBytes": 428232, "rxErrors": 0, "txBytes": 111668, "txErrors": 0, "interfaces": [ { "name": "tunl0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "gre0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "gretap0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "erspan0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "ip_vti0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "ip6_vti0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "sit0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "ip6tnl0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "ip6gre0", "rxBytes": 0, "rxErrors": 0, "txBytes": 0, "txErrors": 0 }, { "name": "eth0", "rxBytes": 428232, "rxErrors": 0, "txBytes": 111668, "txErrors": 0 } ] }, "volume": [ { "time": "2024-03-27T16:32:17Z", "availableBytes": 47286743040, "capacityBytes": 62671097856, "usedBytes": 24576, "inodesFree": 3842689, "inodes": 3907584, "inodesUsed": 11, "name": "config-volume" }, { "time": "2024-03-27T16:32:17Z", "availableBytes": 8327376896, "capacityBytes": 8327389184, "usedBytes": 12288, "inodesFree": 1016518, "inodes": 1016527, "inodesUsed": 9, "name": "kube-api-access-hpzr7" } ], "ephemeral-storage": { "time": "2024-03-27T16:33:47Z", "availableBytes": 47286603776, "capacityBytes": 62671097856, "usedBytes": 73728, "inodesFree": 3842677, "inodes": 3907584, "inodesUsed": 21 }, "process_stats": { "process_count": 0 } }, ```
mnp commented 4 months ago

Thank you @banjoh - we'll be trying some of these new features asap!

banjoh commented 4 months ago

Here are the relevant docs

https://troubleshoot.sh/docs/collect/node-metrics/ https://troubleshoot.sh/docs/analyze/node-metrics/