rancher / charts-build-scripts

Apache License 2.0
9 stars 32 forks source link

Make CRDDirectory and UseTarArchive mutually exclusive #131

Closed joshmeranda closed 3 months ago

joshmeranda commented 4 months ago

Issue

When we enable UseTarArchive when generating crd charts from a template, we copy the crds into the crd chart (ex charts-crd/crd-manifest) and then create the tar archive. For most situations this is fine; however, any chart generated is going to be much larger than necessary due to having the raw crds included alongside the compressed archive of crds.

For charts with large crds this could result in the chart exceeding the maximum size for a helm chart even when the size of the functional chart files is well below that threshold.

Unfortunately, we cannot simply delete the CRDDirectory nor the files it contains due to issues handling the validate-install-crd.yaml resource:

encountered error while applying main changes from charts-crd to main chart: encountered error while trying to add CRD validation to charts based on CRDs in charts-crd: unable to pull any GroupVersionKinds for CRDs from charts-crd/crd-manifest to construct templates/validate-install-crd.yaml

Solution

If we make the crdDirectory and useTarArchive crd options mutually exclusive and handle each option separately, we can avoid including bloat into the generated crd charts. This approach also makes it more obvious from the package.yaml what we expect to be included in the crd chart.

Note

This approach does make it more annoying to generate patches for the crd chart since devs cannot just edit the raw crds directly. Now they will have to untar the crd archive, edit the crds, and then recreated the archive with the modified crds.

adamkpickering commented 4 months ago

Heads up: the build-validate check is failing.

Is there a chart that is at risk of being too large? I imagine this is the case, otherwise you wouldn't be putting together a PR like this. I ask because this appears to be a pretty big change. If we're going to move forward with it, @nicholasSUSE and @lucasmlp need to be in the loop. There may be some design involved.

I didn't (and still don't) completely understand how the CRDs in a CRD chart are installed, so I spent some time trying to understand this today. I noticed a possible problem. Check out dev-v2.9 and run a make prepare on the rancher-monitoring/rancher-monitoring package. You'll get the CRD chart in the packages/rancher-monitoring/rancher-monitoring/charts-crd directory. Take a look at charts-crd/templates/jobs.yaml. It appears to be referring to the contents of the crd-manifest directory in the job that is creating CRDs (line 58). I could be wrong here, but this seems to indicate that we need the untarred CRDs in the crd-manifest directory. If this is the case, it makes this PR more difficult. Can you tell if this is the case?

I need to look at a few other things today, so I'm going to park this until tomorrow. Let me know what you think.

joshmeranda commented 4 months ago

Yeah I knew about the build failing since rancher-monitoring is using both fields. I can't update rancher-monitoring until this PR is merged, and I can't merge this PR until those changes are made, but I can't make those changes until this pr... :carousel_horse:

Yes the new rancher-monitoring crds from a more recent upstream is too large to be installed via helm. They themselves had to do some funky workarounds which lead to the other crd specific pr. Un-archived they are small enough but pushing it, archive they are fine, but together they are too large.

I figured we'd need more eyes on this eventually.

So you are right that the current crd install job uses the crds in crd-manifest; however, they are identical to those in the archive so we can just unarchive the crds into the necessary directory for the job to succeed. I tested this before sending up the PR, so we should be good to here as well.

@adamkpickering let me know if that clarifies some of the sticking points for you. Thanks!

joshmeranda commented 3 months ago

The only way I can see this making sense with the changes in this PR is if the unarchving happens in the Job manifest in (for example) packages/rancher-monitorying/rancher-monitoring/templates/crd-template/templates/jobs.yaml. Can you please confirm that is the case?

Yes that is the case. Strictly speaking I don't know if some of those checks are really necessary anymore, but I'll leave that for another day. Obviously that will increase the job's resource usage, but the chart will be installable.

adamkpickering commented 3 months ago

Sorry to complicate things further, but is it really necessary to make CRDDirectory and UseTarArchive mutually exclusive? I think this is easily doable.

joshmeranda commented 3 months ago

No worries its a completely fair and good question to ask!

Technically they don't NEED to be mutually exclusive. The main problem this is solving is keeping the rancher-monitoring-crd chart below the maximum helm chart size. Unfortunately even just the raw crds (along with the rest of the chart which is generated from those crds) are too big so we need the archive to keep the chart valid.

For the rancher-monitoring use case we need a way to specify that we do not want the raw crds, so here are a few options that I thought of:

  1. Delete crd-manifest

Simplest option is to simply delete the crd-manifest directory and let the charts-build-scripts patch and apply system handle the exclusion for us.

This works fine for a subset of the crd files, but importantly not all because this check will fail. Even deleting the subset of crds will cause a mal-formed validate-crds-install.yaml`, since some of the crds we are supposed to be checking for won't be included.

  1. Create a rancher-monitoring-crd package

We could create a new rancher-monitoring-package to handle the crds directory. This is possible because the rancher-monitoring upstream kube-prometheus-stack keeps its crds in their own subchart. This would effectively handle thei problem for us, the main downside being that we would only have the raw crds and not an archive.

Unfortunately, that crd subchart does not have a matching version with kube-prometheus-stack so the generated rancher-monitoring and rancher-monitoring-crd will have mismatching values (ex 104.0.0+up45.31.1 and 104.0.0+up0.0.0 respectively), and as far as I know there is no good way to override the upstream version (for good reason).

  1. Allow both crdDirectory and useTarArchive

We could allow both options, and just by default ignore crdDirectory and only use the useTarArchive. This solution would work, and the code would be quite similar to what is in this PR.

The main complaint I have with this is that that is not clear enough that only the useTarArchive would take affect. Sure we can document this, but its more clear to make those options exclusive and fail bad configurations early and loudly. IMO not a bad solution, but not the best solution.

  1. Add new onlyUseTarArchive or similar option

We could add some field to signal that we only want the archive and to delete the crd-manifest. This is actually quite appealing since it allows for maximum control of crd chart structure.

But since we don't really need the raw crds, I would argue that it encourages creating charts with useless bloat by including an archive and the original files. Plus feels redundant to have two useTarArchive-esk fields that do similar things.

  1. Make crdDirectory and useTarArchive mutually exclusive

Makes it clear from the package.yaml what the expected generated crd chart will look like. No erroneous paths that aren't used, or seemingly redundant fields.

I could easily be missing something here, but as far as I can tell the "mutual exclusion" solution is our best option.

adamkpickering commented 3 months ago

Fair enough. I guess this code is another case of "it could be a lot better if we refactored it, but right now we are trying to solve a problem" :shrug:

It occurred to me that if there are packages other than rancher-monitoring/rancher-monitoring that define both UseTarArchive and CRDDirectory, that could cause problems when we use this code. I did a check for this, and it looks like rancher-monitoring/rancher-monitoring is the only one :tada:

There are a couple of outstanding comments - I'll approve this once those are addressed. It would also make sense for @lucasmlp or @nicholasSUSE to give this a quick look just to make sure I haven't missed anything.