operator-framework / operator-controller

A new and improved management framework for extending Kubernetes with Operators
https://operator-framework.github.io/operator-controller/
Apache License 2.0
54 stars 53 forks source link

BUG: data too long issue rendering #923

Closed DrummyFloyd closed 2 months ago

DrummyFloyd commented 3 months ago

User story

issue created , due to mention in slack

as a recurent tester on this project, ( because i like it) i test upon some wanted operator on my stack.

atm i have some issue with following manifest with 0.10 OLM

apiVersion: catalogd.operatorframework.io/v1alpha1
kind: Catalog
metadata:
  name: operatorhubio
spec:
  source:
    type: image
    image:
      ref: quay.io/operatorhubio/catalog:latest
      pollInterval: 24h

---
apiVersion: olm.operatorframework.io/v1alpha1
kind: ClusterExtension
metadata:
  name: op-mariadb
spec:
  installNamespace: operators
  packageName: mariadb-operator
  version: 0.29.0

List if issues

error message rendered.

InstallationFailed:create: failed to create: Secret "sh.helm.release.v1.op-mariadb.v1" is invalid: data: Too long: must have at most 1048576 bytes

New content or update?

Net-new content

Types of documentation

No response

References

No response

joelanford commented 3 months ago

Thanks for submitting this issue!

There are two things that cause this bug:

  1. mariadb-operator has some very large CRDs
  2. Helm tries to store the entire release inside of a single secret, and etcd has a size limit for what it can store.
tmshort commented 3 months ago

We used tar + gz to create some of these resources, and that might be necessary here... unless even that is too big.

itroyano commented 3 months ago

We have a couple options to choose from (thanks @varshaprasad96 and @komish! ):

  1. Split the impl into 2 charts - crds and main, with a dependency between them and owner refs determining correct deletion sequence (until we implement a Finalizer that takes care of it). Pros: a best practice by helm team Cons: possibly more edge cases ?

  2. Compress manifests as @tmshort suggested. Pros: should take care of this issue and similar ones. Cons: there could be cases where it's too big even after compression. would require implementing a custom secret encode and decode mechanism, instead of using Helm's default one.

  3. Use a different storage backend e.g. SQL https://helm.sh/docs/topics/advanced/#storage-backends Pros: size is not an issue. Cons: requires spinning up and maintaining a PG instance.

tmshort commented 3 months ago

For (2) we do that for the test registry: https://github.com/operator-framework/operator-controller/blob/bfd41423e48c6fb9ca6567d81ab1904d1a1910c2/testdata/bundles/registry-v1/build-push-e2e-bundle.sh#L34-L37

acornett21 commented 3 months ago

This seems really strange to me that we are offloading CRD creation/management to helm when we know that helm itself can't manage the install/update of CRD's. I was under the impression (though it appears wrong), that OLMv1 was going to mange the creation of CRD's outside the context (knowledge) of helm. It seems that this was only in the context of namespace scoped installs, but even still that seems illogical, that we'd have two different control paths/flows within OLMv1.

Should we possibly look into a 4th option of managing the CRD's ourselves, or is that out of scope(effort/timeline)?

varshaprasad96 commented 3 months ago

Adding some thoughts based on initial findings and discussion with @itroyano:

Option 3: The options available to us in terms of possible supported Helm backends are: Secrets, Configmaps and SQL. The first two don't solve the problem, they have similar size limits. The last one - is in beta, as well as maintaining a separate component is an unnecessary added pain.

Option 2: Compression is a good idea, and helm by default does an encoding and its hardcoded in its implementation. But there is a small caveat here - I am not sure if compressing it twice will enable helm to read secret data. The implementation is in here, and compressing it right way, would cause issues when helm tries to decode and read it (I haven't tried it, but looks so) - eg: https://github.com/helm/helm/blob/1a500d5625419a524fdae4b33de351cc4f58ec35/pkg/storage/driver/secrets.go#L96.

The reason it works in the e2e, I think, is because Kaniko uncompresses the tarball from its context. Based on a quick glance, when we provide a tarball as the context for Kaniko, it extracts the contents of the tarball and then proceeds to build the Docker image using the extracted files. Which means the manifests that are passed for the chart's creation are still uncompressed.

The other option here, as alluded by @joelanford was to reimplement the whole secret driver - which is the code available here: https://github.com/helm/helm/blob/1a500d5625419a524fdae4b33de351cc4f58ec35/pkg/storage/driver/secrets.go. Re-implementing an additional compression, or even creating shards would probably take more effort for us as well as maintaining it could be an additional problem.

Option 1:

Helm by default does not manage the lifecycle of CRDs. If the CRDs are stored in a separate crd/ folder, then they are applied before creating a chart, and is hence not a part of the release (ref: https://github.com/helm/helm/blob/1a500d5625419a524fdae4b33de351cc4f58ec35/pkg/action/install.go#L254-L263). The major benefit of using OLM v1 is the exact solution to this problem, and the intentional reason afaik to include it as a part of release, is so that we implement our own set of CRDUpgradeSafety checks, and let Helm handle CRDs as it would do with any other manifest. If we need to separate CRDs from the rest of the manifests, there are 2 things we could do:

a. Handle CRDs on our own - with a kubectl apply b. Create a Helm chart that contains CRDs and the main chart containing manifests is marked as a dependent to the CRD chart. (This is one of the popular solutions: https://github.com/cloudnative-pg/charts/issues/280#issuecomment-2135939905)

Implementing (a) and (b) are synonymous imo. The only thing is - there shouldn't be an edge case, where CRDs themselves exceed the allowable size of Helm. I'm not sure if that is even a best practice (with other practical concerns that such huge CRDs have on performance, caching, (probably etcd limits) etc).

Both of these methods - which make us manage CRDs separately than manifests, bring in two concerns:

Both options (2) and (3) (ie reimplementing secret driver or separating out CRDs in another chart) come with their own maintenance challenges. The decision is to choose the one that is easier for us to implement and manage.

joelanford commented 3 months ago

I tinkered on this today and came up with a custom driver in helm-operator-plugins that:

  1. doesn't do an extra (unnecessary) layer of base64 encoding (which makes it possible to fit more release data in a single secret
  2. chunks the gzipped data from a release (based on a ChunkSize provided to the driver), and manages an ordered set of secrets for a particular key.

In theory, it could handle up to MaxUint64 chunks, but we would probably want to cap it (maybe at 10?) to have some control of a supportable upper bound.

joelanford commented 2 months ago

@DrummyFloyd I have #1057 up as a possible solution to this issue if you are interested in checking it out!

DrummyFloyd commented 2 months ago

@DrummyFloyd I have #1057 up as a possible solution to this issue if you are interested in checking it out!

│ Status:                                                                                                                                                                                                                                                                                │
│   Conditions:                                                                                                                                                                                                                                                                          │
│     Last Transition Time:  2024-07-17T21:28:50Z                                                                                                                                                                                                                                        │
│     Message:               resolved to "quay.io/operatorhubio/mariadb-operator@sha256:a96b0c89a9cfd307aee2e56b32ce51c2428e19b21e1dbe8f38386956ee73a618"                                                                                                                                │
│     Observed Generation:   1                                                                                                                                                                                                                                                           │
│     Reason:                Success                                                                                                                                                                                                                                                     │
│     Status:                True                                                                                                                                                                                                                                                        │
│     Type:                  Resolved                                                                                                                                                                                                                                                    │
│     Last Transition Time:  2024-07-17T21:28:56Z                                                                                                                                                                                                                                        │
│     Message:               Instantiated bundle op-mariadb successfully                                                                                                                                                                                                                 │
│     Observed Generation:   1                                                                                                                                                                                                                                                           │
│     Reason:                Success                                                                                                                                                                                                                                                     │
│     Status:                True                                                                                                                                                                                                                                                        │
│     Type:                  Installed                                                                                                                                                                                                                                                   │
│     Last Transition Time:  2024-07-17T21:28:39Z                                                                                                                                                                                                                                        │
│     Message:                                                                                                                                                                                                                                                                           │
│     Observed Generation:   1                                                                                                                                                                                                                                                           │
│     Reason:                Deprecated                                                                                                                                                                                                                                                  │
│     Status:                False                                                                                                                                                                                                                                                       │
│     Type:                  Deprecated                                                                                                                                                                                                                                                  │
│     Last Transition Time:  2024-07-17T21:28:39Z                                                                                                                                                                                                                                        │
│     Message:                                                                                                                                                                                                                                                                           │
│     Observed Generation:   1                                                                                                                                                                                                                                                           │
│     Reason:                Deprecated                                                                                                                                                                                                                                                  │
│     Status:                False                                                                                                                                                                                                                                                       │
│     Type:                  PackageDeprecated                                                                                                                                                                                                                                           │
│     Last Transition Time:  2024-07-17T21:28:39Z                                                                                                                                                                                                                                        │
│     Message:                                                                                                                                                                                                                                                                           │
│     Observed Generation:   1                                                                                                                                                                                                                                                           │
│     Reason:                Deprecated                                                                                                                                                                                                                                                  │
│     Status:                False                                                                                                                                                                                                                                                       │
│     Type:                  ChannelDeprecated                                                                                                                                                                                                                                           │
│     Last Transition Time:  2024-07-17T21:28:39Z                                                                                                                                                                                                                                        │
│     Message:                                                                                                                                                                                                                                                                           │
│     Observed Generation:   1                                                                                                                                                                                                                                                           │
│     Reason:                Deprecated                                                                                                                                                                                                                                                  │
│     Status:                False                                                                                                                                                                                                                                                       │
│     Type:                  BundleDeprecated                                                                                                                                                                                                                                            │
│     Last Transition Time:  2024-07-17T21:28:54Z                                                                                                                                                                                                                                        │
│     Message:               unpack successful:                                                                                                                                                                                                                                          │
│     Observed Generation:   1                                                                                                                                                                                                                                                           │
│     Reason:                UnpackSuccess                                                                                                                                                                                                                                               │
│     Status:                True                                                                                                                                                                                                                                                        │
│     Type:                  Unpacked                                                                                                                                                                                                                                                    │
│   Installed Bundle:                                                                                                                                                                                                                                                                    │
│     Name:     mariadb-operator.v0.29.0                                                                                                                                                                                                                                                 │
│     Version:  0.29.0                                                                                                                                                                                                                                                                   │
│   Resolved Bundle:                                                                                                                                                                                                                                                                     │
│     Name:     mariadb-operator.v0.29.0                                                                                                                                                                                                                                                 │
│     Version:  0.29.0                                                                                                                                                                                                                                                                   │
│ Events:       <none>                                                                                                                                                                                                                                                                   │
│                        

work =D

EDIT:

dunno, if this solution can also adresse error upon this kind of issue ?

│     Message:               InstallationFailed:template: object-779770e631338b63.yaml:124:63: executing "object-779770e631338b63.yaml" at <.data.user>: nil pointer evaluating interface {}.user   

if yes , do not hesitate to ask more context =)

joelanford commented 2 months ago

@DrummyFloyd this issue should be fixed in main now.

Do you have a reproducer for the other issue you ran into? If you're still having a problem there, could you open a new issue for that so we don't lose track?

DrummyFloyd commented 2 months ago

@DrummyFloyd this issue should be fixed in main now.

Do you have a reproducer for the other issue you ran into? If you're still having a problem there, could you open a new issue for that so we don't lose track?

thank you for the previous Issue =) yes i have a reproducer case => will create ticket tomorrow =)

apiVersion: olm.operatorframework.io/v1alpha1
kind: ClusterExtension
metadata:
  name: op-eso
spec:
  installNamespace: operators
  packageName: external-secrets-operator
  version: 0.9.20
  serviceAccount:
    name: default
---
apiVersion: catalogd.operatorframework.io/v1alpha1
kind: ClusterCatalog
metadata:
  name: operatorhubio
spec:
  source:
    type: image
    image:
      ref: quay.io/operatorhubio/catalog:latest
      pollInterval: 24h