Stop Relying on `opm` Embedded in Catalog Index Images

stevekuznetsov commented 1 year ago

Today, when a CatalogSource is created, the Pod that serves catalog content uses opm to do so; the version of the server that's used is whichever version happens to be bundled into the index image. This design has a couple of draw-backs:

it's very difficult to reason about which versions of opm are used
updating opm requires rebuilding all catalogs
testing changes to opm requires the same
downstream consumers of the catalog cannot make many assumptions around what version of opm they might encounter in a catalog and what they can assume of it

When the data model was tightly coupled to the binary reading it (at the time from SQLite), bundling both the content and the server made a lot of sense. Today, though, it's possible we're not in that state any longer, as the move to FBC has given us a) a stable format and b) a non-obfuscated storage mechanism.

The catalog-operator already has an --opmImage flag to consume a well-known image containing opm to use, so implementing this feature will only require us to add a new field to the CatalogSource to opt catalogs into this:

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: redhat-operators
  namespace: openshift-marketplace
spec:
  sourceType: grpc
  grpcPodConfig:
    extractContent:
      cacheDir: /var/cache/
      configDir: /var/config/

Then, the catalog-operator can create a Pod that:

runs an init-container with the index image, which copies the cache and config dirs into a shared EmptyDir
run the well-known --opmImage in a container, passing the two directories to it.

Open Questions:

today, opm is deserializing the content of the catalog on disk to validate the cache; if new fields are added to FBC, an older opm will not be able to understand those fields and will fail to reproduce the hash - what should we do?
- we could add an init container with the provided index image to validate the hash, and forgo the check in the main server as long as opm faithfully serves cache content without pruning unknown fields
- we could not make additions to the FBC schema (are we generic enough today that new schema won't occur?)
in general, what other concerns are there for version skew? what's our support statement for catalogs on older clusters? can we guarantee that older catalogs are built with older opm versions?

kevinrizza commented 1 year ago

Five hundred feet, I think this is a reasonable change. It's more or less what we want to do anyway now that FBC is a well known format. So a +1 to the concept.

That being said, I think version skew just becomes an issue here over time and we need to be careful about it. You mentioned the hash generation, but to me that speaks to a larger problem with the way that most folks currently ship catalogs. There is an expectation that catalogs work across cluster versions -- for example in Openshift, it's a pretty common pattern for disconnected clusters to upgrade their cluster then remirror the latest version of the catalog image. I wonder if we need to have a way for the catalog source api to be smart enough to know if a given catalog matches its expected format (fbc in the right place, has the same schema version, etc) and if it doesn't should we fail? Or, if there is a binary in the image, should we attempt to start it and serve it that way? Making the api opt in seems like a middleground, but 99% of the time users aren't actually controlling these fields, they're consuming them through a tool that generated the yaml (oc-mirror for example), as a package (ibm's cloudpak cases), or they're getting it from openshift's cvo.

can we guarantee that older catalogs are built with older opm versions?

I think we have to assume that there's often a diff between the version of opm that built the catalog, the schema version of the fbc, and the current version of the cluster. It's been very useful for us to be able to ship catalogs out of band of openshift patch releases, and making sure that we maintain that feature is really important. Maybe that just means having a very strict fbc schema version and making the cluster schema aware? I don't think we've historically been very strict about versioning the fbc so far.

stevekuznetsov commented 1 year ago

99% of the time users aren't actually controlling these fields, they're consuming them through

Are we OK if the feature we ship is opt-in and we get all of the places we control to opt into it, with the express understanding of what that entails? I am way more interested in solving these 99% cases than making this feature generic enough to be able to be turned on by default for every catalog.

stevekuznetsov commented 1 year ago

Maybe that just means having a very strict fbc schema version and making the cluster schema aware? I don't think we've historically been very strict about versioning the fbc so far.

It's either this, or making the server not read the files it's serving to folks, if it's not going to be doing anything to the FBC data on disk before serving anyway.

kevinrizza commented 1 year ago

Are we OK if the feature we ship is opt-in and we get all of the places we control to opt into it, with the express understanding of what that entails? I am way more interested in solving these 99% cases than making this feature generic enough to be able to be turned on by default for every catalog.

I think that makes sense to me. I'd be interested to hear some other perspectives on how these catalog images are being distributed, but at least for openshift's default case it seems like this should work fine.

It's either this, or making the server not read the files it's serving to folks, if it's not going to be doing anything to the FBC data on disk before serving anyway.

Agreed. Today, though, we don't actually have an explicit versioning scheme for FBC, so it might make sense for us to invent one for this purpose.

grokspawn commented 1 year ago

Agreed. Today, though, we don't actually have an explicit versioning scheme for FBC, so it might make sense for us to invent one for this purpose.

FBC is intended to be versioned by component sub-schema, for e.g. if we need to extend olm.package we can define an olm.package.v2 schema which replaces it. I'm hand-waving past all of the "order-of-precedence and support between v2 and original versions" in the context of this conversation.

stevekuznetsov commented 1 year ago

We could also write a subroutine into opm that checks that all the schema versions it finds on disk are known and serve-able, and if not, exit with some well-known status. On that status we can fall back to using the opm in the index?

joelanford commented 1 year ago

IMO, any v1 version of opm should be backward compatible wrt FBC schema support. So as long as the opm serve version is equal to or newer than the version of opm used to render/build the FBC, then that should always work (until we release opm v2, at least)

operator-framework / operator-lifecycle-manager

Stop Relying on `opm` Embedded in Catalog Index Images #3019