Closed mnp closed 4 months ago
This terrible hack, made possible by the recent addition of runDaemonSet
, may be enough to tide us over. It scrapes the node volume metrics, which are in Prometheus format, and throws them in the collected logs. Some massaging would provide usage percentage.
collectors:
- runDaemonSet:
name: "disk-usage"
podSpec:
containers:
- name: metrics-scraper
image: busybox
command:
- "sh"
- "-c"
- "wget -q -O - ${NODE_IP}:10255/metrics | grep kubelet_volume_stats"
env:
- name: NODE_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
This might be a good place to use Prometheus and have that send an alert when space is looking like it's low.
Let's spend some time looking at that kubedf, to see what we can do with it.
Yes, good point @xavpaice - we had the same idea collecting into our prometheus (we try not to step on yours) and then retrieving a few key metrics from there. Someone else pointed out the kubedf method only works for certain CSI providers so nothing is both holistic and 100% here it seems.
My thoughts here would be to collect all the node metrics with a new collector similar to the custom metrics one. In hindsight, we should have just created a k8s metrics collector to collect node and external metrics
The new collector would be something simple like below
apiVersion: troubleshoot.sh/v1beta2
kind: SupportBundle
metadata:
name: sample
spec:
collectors:
- nodeMetrics: {}
We can then have a nodeMetrics
analyser that report on various, starting with volume consumption at this stage, similar to the node resources analyser. It would something like below
Thank you @banjoh - we'll be trying some of these new features asap!
Here are the relevant docs
https://troubleshoot.sh/docs/collect/node-metrics/ https://troubleshoot.sh/docs/analyze/node-metrics/
Describe the rationale for the suggested feature.
Troubleshoot collects PVC specs but not disk usage.
Describe the feature
K8s users can use a script like
kubedf
available here which calls the/api/v1/nodes
API and collect capacity bytes, available bytes, and percent used. This algorithm would port cleanly to go for implementation as a collector, maybe call it "pvcDiskUsage"?I imagine it would take optional namespace and optional pvc name (default=all). Note that not everyone knows all their PVC names ahead of time, sometimes they're dynamically created.
Describe alternatives you've considered
df
in that pod. However, pods which are "from scratch", et al, do not containdf
so that's not always an option.kubedf
,jq
, andkubectl
and run that withrunPod
. It would be better if it was builtin to troubleshoot.http
collector pull it. That would be ideal also, but I didn't see one.Additional context
Our users create PVCs dynamically and when they fill up, it's a source of errors. A support bundle containing utilization metrics would be ideal.