operator-framework / operator-registry

Operator Registry runs in a Kubernetes or OpenShift cluster to provide operator catalog data to Operator Lifecycle Manager.
Apache License 2.0
212 stars 247 forks source link

Newer versions of opm serve have truncated data returned via the grpc GetBundleForChannel api? #1077

Open joehuizenga opened 1 year ago

joehuizenga commented 1 year ago

In prior versions of opm the GetBundleForChannel returned a lot more data than it does today, our testing pipeline relies on this data, particularly the bundlePath element.   Example:   Older catalog podman run -d --name catalog -p 50051:50051 "icr.io/cpopen/datapower-operator-catalog@sha256:2b29c318c29dd11b9a8075e64268ba94a93a7d9917cbf5f5f741eb9decbe3bf8" grpcurl -plaintext -d '{"pkgName": "datapower-operator", "channelName": "v1.4"}' localhost:50051 api.Registry.GetBundleForChannel | jq -r .bundlePath

returns the bundle image as expected icr.io/cpopen/datapower-operator-bundle@sha256:025a17b6b198843ab8c5bb4ab8490400ff181397c0b7c811fb3f59fb4ecf28d8   New catalog podman run -d --name catalog -p 50051:50051 "icr.io/cpopen/datapower-operator-catalog@sha256:6f2289e5d54acc62b10e1bd632a5ba45e8348c7f3b352996188e524fb2639b9b" grpcurl -plaintext -d '{"pkgName": "datapower-operator", "channelName": "v1.4"}' localhost:50051 api.Registry.GetBundleForChannel | jq -r .bundlePath

returns null?? null

joelanford commented 1 year ago

After a bit of investigation, it appears that we trimmed down the GetBundleForChannel response to resolve some on-cluster performance issues. See https://github.com/operator-framework/operator-registry/pull/769

That PR landed in 1.18.1. So the immediate fix is to revert to 1.18.0 if possible. Clearly that release was quite awhile ago, so there are probably reasons for that not being a great option.

One possibility is that we could add the bundlePath field back into the response.

Lastly, the FBC server does not exhibit this behavior on an FBC that is migrated from sqlite. Which begs two more questions:

  1. We have tests that verify that the GRPC responses from sqlite and FBC servers are identical with equivalent input. How are those tests passing?
  2. As users migrate to FBC, are clusters going to take the same performance hit that #769 already solved for sqlite-based catalogs?
joelanford commented 1 year ago

I think there are two answers for item (1):

  1. 769 updated the test for Sqlite, but not FBC. In retrospect, it isn't obvious looking at the test code that the expectation is that FBC and Sqlite responses should stay aligned. We may need to think about how to avoid this sort of drift in the future.

  2. It appears as if our testdata is loaded from the old packagemanifests format and therefore doesn't contain bundle image references. So all this time, the bundlePath field of the API bundle has been unset for our tests.
joehuizenga commented 1 year ago

Since this change was done way back in 1.18.0 and our usage of this api in our pipeline is more of an optimization, I think we can implement a workaround in our pipeline to locate the bundlePath (aka image), list all bundles, save as a map by csvName , use the GetBundleForChannel as is (which returns the csvName) and locate the bundle image using the map. As for the potential impact on performance from an FBC perspective, thinking that should be in another ticket