operator-framework / operator-controller

Apache License 2.0
29 stars 47 forks source link

catalogmetadata client adds extreme latency to reconciliation #914

Closed joelanford closed 1 week ago

joelanford commented 4 weeks ago

I've narrowed this latency down to the Bundles() call that is made for every single reconcile.

I added logging to the catalogmetadata.Client.Bundles() call to report start time and end time, and the function consistently takes around 30s, and it pegs a CPU.

This seems like a bug we should treat with a fairly high priority.

bentito commented 3 weeks ago

I added some timing data to the Bundles func too, documenting times for the various return points (https://gist.github.com/bentito/9fbf0d81354caa52121f5c4e294bd506). The fake catalog is just 1 entry I think? Would ~350µs (the longest test timing in 1 run (not scientific!) number items in real catalog # of calls to Bundle get close to 30s?

joelanford commented 2 weeks ago

I think that's probably an order of magnitude off still. I suspect that our fake catalog entry isn't as complex and varied as what exists in operatorhub in terms of what takes the most time (i.e. icons, olm.bundle.object properties).

acornett21 commented 2 weeks ago

Are we talking about this function? Or something else?

joelanford commented 2 weeks ago

Yep, that's the one.

joelanford commented 2 weeks ago

I think there's a broad change that needs to be made to solve this. We should add a controller for Catalog objects that manages a local cache for each catalog that exists. This controller would be able to:

Separate, but related, I think the catalogmetadata wrapper types are adding some unnecessary abstraction, and we will be better off interacting directly with the FBC data model (e.g. use only operator-registry's declcfg package, to the extent possible)