replicatedhq / troubleshoot

Preflight Checks and Support Bundles Framework for Kubernetes Applications
https://troubleshoot.sh
Apache License 2.0
545 stars 93 forks source link

Ability to analyze properties of a PersistentVolumeClaim #1195

Closed diamonwiggins closed 1 year ago

diamonwiggins commented 1 year ago

Describe the rationale for the suggested feature.

Currently, Troubleshoot has no native way of analyzing a PersistentVolumeClaim to ensure that it has the minimum amount of storage required, or that it contains a required access or volume mode. A user may need this functionality in order to guarantee in a Preflight that a user provided PersistentVolumeClaim is in a desired configuration.

Describe the feature Provide functionality where a user can check for a min or max value on the storage size of a PVCZ as well as its access or volume modes.

Option 1: A separate analyzer similar to nodeResources - (https://github.com/replicatedhq/troubleshoot/blob/main/pkg/analyze/node_resources.go), provide a new analyzer where a user can check for a min or max value as well as access or volume modes.

    - nodeResources:
        checkName: check-pvc-size
        outcomes:
          - pass:
              when: "storageCapacity > 100Gi"
              message: pass
          - fail:
              message: fail
    - nodeResources:
        checkName: check-access-mode
        outcomes:
          - pass:
              # Would also need to support a `contains` sort of operator for lists
              when: "accessMode == ReadWriteMany"
              message: pass
          - fail:
              message: fail

Option 2: Add functionality for parsing resource quantities to the existing clusterResourceanalyzer

https://pkg.go.dev/k8s.io/apimachinery/pkg/api/resource#MustParse

    - clusterResource:
        checkName: check-pvc-size
        kind: pvc
        name: pvc-name
        namespace: default
        yamlPath: "spec.resources.requests.storage"
        regexGroups: '(?P<PVCSize>\d+[A-Za-z]{2})'
        outcomes:
          - pass:
              when: "PVCSize > 100Gi"
              message: PVC is the right size
          - fail:
              message: PVC is too small
    - clusterResource:
        checkName: check-access-mode
        kind: pvc
        name: pvc-name
        namespace: default
        yamlPath: "spec.accessModes"
        regex: ReadWriteMany
        outcomes:
            - fail:
                when: "false"
                message: is not RWX
            - pass:
                when: "true"
                message: is RWX
xavpaice commented 1 year ago

Seems like the second option is going to be something that's useful beyond this particular request, so I have some preference for that rather than maintaining a separate, specific, analyzer that won't see a ton of use.

CpuID commented 1 year ago

While I agree option 2 is more flexible/useful - I do wonder if the YAML syntax being far simpler/readable in option 1, if taking an approach like that would be better? We may need to implement more possibilities in Go to handle things of course...

CpuID commented 1 year ago

Pro's and Con's of each:

Option 1

Pro: better YAML readability Pro: less error prone for a vendor to implement (unlikely to "get it wrong"?) Con: need to write extra Go code for each thing we want to inspect in future (could be made fairly extensible though)

Option 2

Pro: flexibility (can be used on lots of different resource types) Con: error prone spec authoring by vendors? (will they get the yamlPath or the regex wrong, and open us up to having to support authoring those more?) Con: reduced YAML readability Con: implementation of going through nested structs for yamlPath in Go could get interesting? maybe less of an issue knowing about interfaceutils... example: https://github.com/replicatedhq/troubleshoot/pull/780/files#diff-2d83a08caa568888b3ef7aa56971f130408c38f15ee58fc0f0ffe45c99f53219R91


FWIW I'm leaning towards Option 1

diamonwiggins commented 1 year ago

I'm not saying we shouldn't do Option 1, but I think some additional pros for Option 2 would be:

CpuID commented 1 year ago

After I started getting in to try write some of this, I realised the Option 1 example YAML syntax will need extra fields added to it (eg. PVC name), which while its not much - it does start to make it less clean/minimal, and the bigger the Option 1 syntax gets the more weight leans towards Option 2 as it's more flexible.

Then I saw @danj-replicated got an itch and threw https://github.com/replicatedhq/troubleshoot/pull/1210 out there - which looking at it handles the PVC size check aspect in an Option 2 style approach. I also found in the test suite for clusterResource there's examples of checking the access mode of a PVC in there.

Might be worth me not pursuing Option 1 based on progress so far...? May not be necessary at the moment. We can always circle back if it's deemed necessary later.

CpuID commented 1 year ago

Did a bunch of testing with Dan's branch, and it basically ticks the boxes that this issue requires right now, and with even cleaner YAML syntax than the Option 2 example at the top.

https://github.com/replicatedhq/troubleshoot/pull/1210#issuecomment-1579379095

I'm leaning towards getting @danj-replicated 's PR reviewed/approved/merged at this point 👍

CpuID commented 1 year ago

This was resolved in https://github.com/replicatedhq/troubleshoot/pull/1210

Released in Troubleshoot v0.68.0 https://github.com/replicatedhq/troubleshoot/releases/tag/v0.68.0

The example documented here covers how it can be used https://troubleshoot.sh/docs/analyze/cluster-resource/#example-analyzer-definition