operator-framework / helm-operator-plugins

Experimental refactoring of the operator-framework's helm operator
Apache License 2.0
48 stars 48 forks source link

✨ Add support for a chunked release storage driver #350

Closed joelanford closed 1 month ago

joelanford commented 1 month ago

This commit adds a custom helm release storage driver that overcomes limitations in the size of a single value that can be stored in etcd.

In order to remain backward-compatible and also make this storage driver available, this commit also refactors the ActionConfigGetter options so that a custom function can be provided to the ActionConfigGetter that can create the desired storage driver.

This commit also separates the rest config mapping into two separate options. One for interactions with the storage backend, and the other for managing release content.

itroyano commented 1 month ago

Hi @joelanford some questions:

  1. Will this mean Helm CLI won't be able to see/parse the "chunked" release secret we made with Create()?
  2. Some funcs have deprecation warnings. possible to implement the newer ones?
  3. Does using base32 in the "chunk" reduce the size of the resulting secret?
  4. I don't see an explicit call to get release. Will helm.sh Client be calling it when we use it in the code?
joelanford commented 1 month ago

@itroyano

Will this mean Helm CLI won't be able to see/parse the "chunked" release secret we made with Create()?

Yes, and that is desirable because:

  1. Our use of helm libraries is an implementation detail that helm CLI users should not know about.
  2. The Helm CLI would barf on them anyway

Some funcs have deprecation warnings. possible to implement the newer ones?

Yes, the newer ones are already implemented in this PR, and the deprecation comments on the exported functions tell users what to switch to. Technically, because helm-operator-plugins is pre-1.0, we could remove instead of deprecate, but I figure we can let them stay for at least a few more releases.

Does using base32 in the "chunk" reduce the size of the resulting secret?

Base32 is just to encode a hash for the chunk with a string that is short enough and also valid for a metadata.name and label value. It is not used to store the chunk data itself. The chunk data itself is written directly as gzipped binary byte slices.

I don't see an explicit call to get release. Will helm.sh Client be calling it when we use it in the code?

The storage driver is setup in the helm library's action.Configuration.Releases. Other bits of the helm library use the action.Configuration struct to interact with the cluster.

codecov-commenter commented 1 month ago

Codecov Report

Attention: Patch coverage is 80.00000% with 69 lines in your changes missing coverage. Please review.

Project coverage is 51.85%. Comparing base (08ab7fb) to head (14906f8). Report is 27 commits behind head on main.

Files Patch % Lines
pkg/storage/chunked.go 80.08% 29 Missing and 18 partials :warning:
pkg/client/actionconfig.go 67.92% 12 Missing and 5 partials :warning:
pkg/client/ownerrefclient.go 80.76% 3 Missing and 2 partials :warning:

:exclamation: There is a different number of reports uploaded between BASE (08ab7fb) and HEAD (14906f8). Click for more details.

HEAD has 1 upload less than BASE | Flag | BASE (08ab7fb) | HEAD (14906f8) | |------|------|------| ||2|1|
Additional details and impacted files ```diff @@ Coverage Diff @@ ## main #350 +/- ## =========================================== - Coverage 85.06% 51.85% -33.22% =========================================== Files 19 65 +46 Lines 1346 3238 +1892 =========================================== + Hits 1145 1679 +534 - Misses 125 1456 +1331 - Partials 76 103 +27 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

everettraven commented 1 month ago

This is cool to see a custom helm storage driver!

I do have a more fundamental question - Is chunking/partitioning across multiple secrets a scalable approach? I don't necessarily think it needs to be optimized immediately, but considering this is likely going to be used by OLMv1, do we know:

While I do like the concept, I'm not confident that a "chunked" secrets driver is going to be the best approach here just out of scalability and API server load concerns. Additionally, this would require OLMv1 as a user of this API to have access to secrets (which isn't any worse than it is now), is this something we are OK with?

Could a potential alternative to utilizing secrets be some other in-memory or file-based storage mechanism? Maybe something like https://github.com/dgraph-io/badger or BoltDB are some good embeddable key-value store options?

EDIT: For clarification, if data shows that the concerns I expressed in this comment are not realistically a concern then I am happy to look at this PR again with a new frame on the changes

joelanford commented 1 month ago

I do have a more fundamental question - Is chunking/partitioning across multiple secrets a scalable approach?

I think it can be. We could impose a maximum number of chunks to prevent unchecked release size. We could also use a controller-runtime cache to read secrets instead of the default live client.

I don't necessarily think it needs to be optimized immediately, but considering this is likely going to be used by OLMv1, do we know:

  • The load on the API server for a typical install via the ClusterExtension API (maybe a better ? - how many secrets are created on average when installing a package from the community catalog with this change?)

Anecdotally, I tried reproducing https://github.com/operator-framework/operator-controller/issues/923, and it only created one secret because this storage driver doesn't do an extra/unnecessary base64-encode (that the helm secrets driver does). I'm speculating, but I would guess this driver would create 1 secret on average from the community catalog, so no more than are already created.

  • The load on the API server when getting the release information from the chunked secrets using a typical install via ClusterExtension (IIRC we fetch the release information to determine state on every reconcile)?

It would only be increased for the releases that require chunking. One thing we could do relatively easily in the future is implement a SecretInterface that actually reads from the controller-runtime cache.

  • From OLMv0, the average number of operators installed on a cluster?

I think I've seen metrics of ~10 on average.

While I do like the concept, I'm not confident that a "chunked" secrets driver is going to be the best approach here just out of scalability and API server load concerns. Additionally, this would require OLMv1 as a user of this API to have access to secrets (which isn't any worse than it is now), is this something we are OK with?

This is actually part of the intent of this PR. OLM should be the thing that has access to secrets. We already have the ability to store all of these secrets in OLMv1's system namespace. This PR will also make it possible to use to use a different service account to manage the release secrets.

We would give OLMv1's service account read/write access to secrets just in its namespace AND we would not require users to provide service accounts with r/w permissions, which is good because that would leak an implementation detail that users should not be concerned by.

Could a potential alternative to utilizing secrets be some other in-memory or file-based storage mechanism? Maybe something like dgraph-io/badger or BoltDB are some good embeddable key-value store options?

This release data needs to persist beyond the lifetime of an individual operator-controller pod. As we've discovered in other areas, the only reliably available persistence mechanism is Kubernetes API + etcd.