replicatedhq / troubleshoot

Preflight Checks and Support Bundles Framework for Kubernetes Applications
https://troubleshoot.sh
Apache License 2.0
545 stars 93 forks source link

Enable analyzers that work on "available" resources: in other words those not already reserved #1182

Open crdant opened 1 year ago

crdant commented 1 year ago

Describe the rationale for the suggested feature.

I'd like to be able to include logic around "available" resources on a node in when writing analyzers that deal with node resources. This will help me close in on whether my Kubernetes will be able to schedule my pod before I attempt my install, assuming I align my check with my resource requests.

Describe the feature

With this feature implemented, I'd be able to write a preflight that looks like this:

        - nodeResources:
            checkName: Are sufficient CPU resources available in the cluster
            outcomes:
              - fail:
                  when: "min(cpuAvailable) < 250m"
                  message: Your cluster currently has too few CPU resources available to install Gitea
              - pass:
                  message: Your cluster has sufficient CPU resources available to install Gitea
        - nodeResources:
            checkName: Is sufficient memory available in the cluster
            outcomes:
              - fail:
                  when: "min(memoryAvailable) < 256Mi" 
                  message: Your cluster currently has too little memory available to install Gitea
              - pass:
                  message: Your cluster has sufficient memory available to install Gitea

and fail the install if my resource requests could not be fulfilled on any node (or any node that I've filtered into my analyzer).

kubectl describe node provides insight into these values, but they are not available as part of the status of the node so just getting the node doesn't show them.

Describe alternatives you've considered

Describe alternative solutions here. Include any workarounds you've considered.

Additional context

It seems like the best way to handle this is to collect all the resource requests for the pods running on the node and subtract that from the allocatable resources on that node. Based on the order of the kubectl describe node output I'd bet that's what it is doing, though I haven't read through the code to check.

I also asked ChatGPT to write me a kubectl plugin to calculate this to see what the code might look like, I'm attaching it for fun and reference. kubectl-available-plugin.tar.gz

@chris-sanders, @diamonwiggins, and I chatted about this on Slack.

crdant commented 1 year ago

At first it seemed like this would be feasible to do within the analyzer, since clusterResources would contain everything needed. Unfortunately it looks like it has to happen in the clusterResources collector in case the pods collected are limited to certain namespaces.

It seems like the best thing to do would be to add the value to every node when collecting node info. My first thought was to put it only status, but that would break the type. Feels like putting it on an annotation would make sense in that case. Something like troubleshoot.sh/cpu-available and troubleshoot.sh/memory-available added to the nodes as after getting the item list.

Any thought on this design?