Create CRD for Troubleshoot specs

xavpaice commented 2 years ago

As part of #650 we would like to create a Custom Resource Definition to be able to define specs for Troubleshoot.

Ideally a single CRD would be created for Troubleshoot to reduce the number of customer objects which need to be maintained. However, reviewing best practices and including validation a recommendation for more than one CustomResourceDefinition may be appropriate.

Definition of done:

A CustomResourceDefinition with all the fields necessary for Troubleshoot to operate on instead of the current yaml Spec.
The YAML for the CRD stored in the Troubleshoot repo
Documentation on how to install the CRD in a cluster
Design documentation describing how to retrieve Custom Objects and their expected use in Troubleshoot added to the design folder of Troubleshoot

edgarlanting commented 2 years ago

Currently reading up on CRDs & how we can implement our specs into it. Working on a small test CRD to see how we can use & design this. I've also checked in internally regarding the existence and previous work done here.

edgarlanting commented 2 years ago

Todo: summarize the limitations it would impose on us if we were to use CRDs regarding permissions etc. so that we understand what we're changing?

Based on an internal discussion on this, using a CRD for collectors and analyzers in combination with Troubleshoot, seems to impose some caveats/challenges:

Using this method would cause an issue when KOTS is setup in minimal RBAC mode since CRDs are not namespaced. However, CRDs can be either namespaced or cluster-wide by setting the appropriate value within the spec.scope field:

# either Namespaced or Cluster
  scope: Namespaced

CRDs always use the same authentication, authorization, and audit logging as the built-in resources of your API server as quoted by the official k8s docs so this means that by default most customers who implemented RBAC will not have or grant access to any new resource that is introduced (like our Troubleshoot CRD).

Some customers will of course be true admins but some will not be, so that will be a point of interest to further investigate and test. To work around this we could suggest or add the option to add a new ClusterRole and get the customer user(s) added to that role which then allows them to use Troubleshoot with a CRD as an option instead of a new default. If they don't want to implement that, then the regular spec will still be used.

One task to perform is to create a normal and minimal RBAC cluster and see how this all functions in a hypothetical customer situation.

The spec we always apply/use is currently already a CRD that has not been deployed on a cluster. According to our Makefile there already are schema and generate targets:

  76   │ .PHONY: generate
  77   │ generate: controller-gen client-gen
  78   │     $(CONTROLLER_GEN) \
  79   │         object:headerFile=./hack/boilerplate.go.txt paths=./pkg/apis/...
  80   │     $(CLIENT_GEN) \
  81   │         --output-base=./../../../ \
  82   │         --output-package=github.com/replicatedhq/troubleshoot/pkg/client \
  83   │         --clientset-name troubleshootclientset \
  84   │         --input-base github.com/replicatedhq/troubleshoot/pkg/apis \
  85   │         --input troubleshoot/v1beta1 \
  86   │         --input troubleshoot/v1beta2 \
  87   │         -h ./hack/boilerplate.go.txt
  88   │
  89   │ .PHONY: openapischema
  90   │ openapischema: controller-gen
  91   │     controller-gen crd +output:dir=./config/crds  paths=./pkg/apis/troubleshoot/v1beta1
  92   │     controller-gen crd +output:dir=./config/crds  paths=./pkg/apis/troubleshoot/v1beta2
  93   │
  94   │ .PHONY: schemas
  95   │ schemas: fmt vet openapischema
  96   │     go build ${LDFLAGS} -o bin/schemagen github.com/replicatedhq/troubleshoot/cmd/schemagen
  97   │     ./bin/schemagen --output-dir ./schemas

edgarlanting commented 2 years ago

Current status of tasks/work:

Reviewed CRD information and background documentation at the official k8s docs.

Gone through the existing CRD configuration in our Troubleshoot repo here. Since it seems that there has been some work done on this previously. It does look like we handle our spec as a CRD but we don't install it into a cluster.

Checked the aforementioned CRD config files for structure and sense (stuttering) but while the logic behind this is not yet clear to me as I haven't spoken to the engineers involved (Andrew, Ethan, Salah, Dmitriy) my next step would be to talk to either of them (based on most appropriate timezone) and get more insights on the how and why of this.

Once we have that information, we can build upon this basis and get into implementing this as an option to Troubleshoot (to use a CRD). Once good example that Xav & Chris made, is that the comment that CRDs are not useable in our case is somewhat invalid because we for instance also do the same thing with Velero. Velero uses CRDs to function and installation on for example, an RBAC enabled cluster will need to be done by a customer contact with appropriate access as is pretty normal in any production like environment.

For @xavpaice reference; I might have explained the option to create a CRD namespaced or cluster-wide incorrectly to you as this actual function still needs admin like access to the cluster as I've described in my previous comment.

One other thing that we need to do is add the info to edit/update and operate a CRD in the cluster in combination with Troubleshoot to a README.md that I've already added to the config/crds folder in this repo but needs to be merged from this PR.

marccampbell commented 2 years ago

@edgarlanting thanks for this write up. I think you've come across our problem here, which is that it's not possible to install a CRD without extremely elevated RBAC permissions. We intentionally made the decision to "look like a CRD" but not write an operator here to ensure that this is functioning as a client side plugin, not a server side plugin.

The issue here is "Create CRD", which is done, right? Our "kinds" are Custom Resource Definitions, by even the strictest interpretation. Is the issue here more of "allow the CRD to be deployed, i.e. an operator that functions in the cluster on it?"

The thing I'm not clear about is the connection to #650. Can you elaborate on how an "installable" CRD help provide a mergable spec?

xavpaice commented 2 years ago

The first two bulletpoints are already done as per what's already in main - we just didn't know till Edgar found it there.

The second two look like they need some further discussion.

This task is not on the critical path for mergeable specs itself, but allows a different mechanism to provide multiple specs.

@marccampbell the intent with "installable" CRDs is that maintainers of various components (e.g. the application, kURL, KOTS, etc.) can install their own list of custom resources for Troubleshoot specs that Troubleshoot could then search for and combine, removing the maintenance of those specs from the Troubleshoot repo (or kots). This allows, e.g., each add-on in kURL to install an appropriate collector, analyzer and redactor for that add-on only, plus allow someone to update the specs for that component without needing to run other updates at the same time.

A similar alternative might be to search for multiple secrets or configmaps, but looking at the k8s docs it appears that CRDs are the appropriate place for these. I use kURL and kots as examples here, but there's no reason someone not using other Replicated tools wouldn't benefit from the same.

The issue isn't so much with the CRDs being in files, but that if the CRDs are not installed in the cluster then the Troubleshoot specs (custom resources in this case) need to be provided to Troubleshoot in some compatible way. We are trying to avoid having to incorporate all the ready-made specs in the Troubleshoot repo, so that folks can update their specs independently. The biggest effect is that new specs can be provided without needing to run updates for, say, kots, which folks would rather avoid. Currently, the default specs living in the kots code is a barrier to getting updates.

One other alternative is to install secrets with each component that includes a collector and analyzer spec (and redactor?), and have Troubleshoot list out all the secrets matching some selector.

Is the problem with installing CRDs the CRD installation itself, or the add/update of the custom resources? If we're comfortable with Troubleshoot making use of (but not requiring) specs located in the k8s cluster, is it possible to ask a sysadmin to do the CRD installation as part of setup as a one time thing? Not sure about schema updates though.

For others to reference: this file shows where we consume the CRDs in the source with schemagen. This article is a good write up of how this works).

edgarlanting commented 2 years ago

Thanks @marccampbell for checking into this and @xavpaice for answering the question!

chris-sanders commented 2 years ago

The only other thing I don't see addressed above by Xav that I'll add on.

"allow the CRD to be deployed, i.e. an operator that functions in the cluster on it?"

A small distinction I make, I don't think allowing a Custom Resource to be installed necessarily requires having an operator that operates on it. I'm not saying an operator could never be a good feature but I don't see value in one and we would have to find good justification for that and the amount of upkeep that brings. There's nothing about this effort that is intending to prepare for or imply a switch to an operator in-cluster.

I agree the value is client side and I think Custom Resources are a good additional option to from a client side perspective.

marccampbell commented 2 years ago

@chris-sanders, thanks. Given that, I guess I don't understand yet. Without an operator, being able to install CRDs to the cluster feels mostly like a "fork"? IIRC, troubleshoot can read from Secrets in the cluster, found with a label. These secret data are the Custom Resources for collectors, analyzers, support bundles, etc.

Is this a matter of being "technically correct" and doing it the native Kubernetes way? Or are we missing functionality because we don't install these a Custom Resources natively today?

chris-sanders commented 2 years ago

It's more about using the available tools for a better UX, our use case is Custom Resource which we just don't install. While we can work around some things, we can't work around others like the API server performing validation of the custom object. It's also a lot easier, for example, for a user to use kubectl get Backup -A to see all the backups they have defined than to try and figure out how to find all the custom ConfigMaps which represent backups.

Adding CRDs just allows us to provide users who can and are fine installing a CRD a significantly better experience. It doesn't mean you can't reference a Secret, but it does mean that Secret has no validation from the API server and I don't see a reason to try and implement that if we provide an option to use a Custom Resource which will be validated.

@marccampbell especially given the CRDs already exist and are maintained today. Do you see some issue with offering it along with the half dozen other ways we have to specify a bundle?

chris-sanders commented 2 years ago

After some more discussions internally, this is not an easy call. We'll finish confirming that these CRDs can be installed if we want to but give some more thought if we put in the effort to make CRDs another option that can be specified on the command line.

marccampbell commented 2 years ago

It sounds like we are getting to the same place. It would be nice to offer it as an in-cluster CRD, but it's also another way to maintain and support moving forward. We definitely can provide validation in the kubectl plugin, if that was the selling point (i don't think that's the whole reason though).

kubectl get .... preflights doesn't feel incredibly useful though. That's just the spec to run it, what would that do and how would you start a new preflight (as an example)?

chris-sanders commented 2 years ago

Yup I agree with you @marccampbell it carries additional support to implement and the benefit isn't clear. We're taking this off the current development backlog.

replicatedhq / troubleshoot

Create CRD for Troubleshoot specs #664