mykhailo-b commented 2 years ago

Greetings. We faced the problem of high CPU usage by the olm operator in openshift 4.11

https://github.com/okd-project/okd/releases/download/4.11.0-0.okd-2022-08-20-022919/release.txt

We examined the source code and images of the operator ( https://quay.io/repository/openshift/okd-content/manifest/sha256:6ad02f2e27937f4ec449718c27dbbb0870b55c910b21f4a22f202ce1cfb56d6f ) and found out that the operator is built from this repository https://github.com/openshift/operator-framework-olm

It seems that this repository uses an outdated version of operator-lifecycle-manager https://github.com/openshift/operator-framework-olm/tree/master/staging/operator-lifecycle-manager https://github.com/openshift/operator-framework-olm/tree/master/vendor/github.com/operator-framework/operator-lifecycle-manager

Although it is indicated that this is version 0.19.0 https://github.com/openshift/operator-framework-olm/blob/master/staging/operator-lifecycle-manager/OLM_VERSION but in reality it's not.

We were especially interested in the absence of this fix https://github.com/operator-framework/operator-lifecycle-manager/commit/b85df587ecdcc1adb0f89ad4c6e8bac0b7a75af2

Can you comment on our findings ?

kevinrizza commented 2 years ago

Hey @mykhailo-b ,

That repo (https://github.com/openshift/operator-framework-olm) is where OCP and OKD releases live. Today, there isn't a relationship between specific versioned releases of this repository and that downstream OpenShift version -- the openshift branches are generally a snapshot in time + a set of curated commits that are pulled onto a given release branch.

The fix you referenced is quite old and actually predates the inception of that downstream openshift repository, so it's definitely included -- keep in mind that that repo is not a fork so there isn't commit matching, but you can search commit message history after initial inception (which happened around the end of the year in 2020) for specific commits if you are interested.

So, all that being said, I think it's unlikely that the commit you referenced is related to a performance problem you're having in OKD 4.11. Could you give us any more information about the specific cpu issue? Are you seeing it on the olm-operator? The catalog-operator? What's the topology of your cluster? Any specific data you have about the cpu profile would be helpful.

EugeneMospan commented 2 years ago

Hi @kevinrizza

Thank you for your quick reply. Let me step in because we were working together with @mykhailo-b on the issue.

Our context is the following: 1) We are using OKD 4.11 2) We have OpenShift Container Storage installed into the cluster 3) We are seeing that Olm-operator continuously consumes CPU about 700mCores and it is continuously updating the status resource of the kind: Operator, which has a name on our side ocs-operator.openshift-storage 4) We figured it out by setting debug level of logging for olm-operator. You can see logs on the screen below MicrosoftTeams-image (26) 5) Then we started looking into the code and find this line of code for version 0.19 https://github.com/operator-framework/operator-lifecycle-manager/blob/864b58ddc63742b53ecdf21f463e13ac2ce9de7e/pkg/controller/operators/operator_controller.go#L281 6) We tried to build and image including this functionality and deployed it our cluster 7) After this olm-operator stopped continuously reconciling kind: Operator ocs-operator.openshift-storage and as a result olm-operator stopped consuming CPU

Could you please guide us on what is wrong, we are not sure that is safe to go ahead with such a workaround and we had to switch of a cluster-version-operator, because it replaces our custom changes with the original ones.

BR, Eugene

awgreene commented 2 years ago

Hello there,

I appreciate your patience on this matter. I confirmed that the latest version of OLM was spamming the api server with operator CR status updates. I then created a branch of OLM from master and reverted the commit introduced in #2697 which resolved the issue.

2697 was created to address an issue where the operator CR status didn't capture all resources associated with an operator. The fix will need to address those needs while not introducing spamming the api server with status updates.

EugeneMospan commented 2 years ago

Hi @awgreene ,

Thank you for the investigation! Could you please guide us on when the fix will be introduced to OKD itself? At the moment what we do to avoid CPU consumption is not an optimal way ...

BR, Eugene

awgreene commented 2 years ago

@EugeneMospan,

I hope to create a PR fixing this issue later this week. In a worst case scenario where a suitable fix cannot be found, I will consider reverting #2697 to at least resolve the unacceptable CPU usage.

I've applied the priority/critical-urgent label to convey the severity of this issue.

Best,

Alex

awgreene commented 2 years ago

Hey @EugeneMospan,

I took a look and found that the operator CR includes a list of related components in its status. The list of components was ordered by GVK but GVK types weren't ordered by namespace/name, potentially causing OLM to spam the server. The changes in the #2880 should address the issue you've hit.

I suspect that it will take a few days to move the API changes out of the vendored dir and into github.com/operator-framework/api, but feel free to test the image changes with this image: quay.io/agreene/olm:operator-api-spam

EugeneMospan commented 2 years ago

Thank you @awgreene we will try and come back to you

EugeneMospan commented 2 years ago

@awgreene I've applied the fix to one cluster, at first glance it is not spamming requests to update Operator status. If issue comes back, I will let you know

BR, Eugene

awgreene commented 2 years ago

Thanks @EugeneMospan!

beelzetron commented 2 years ago

Hello, I'm hitting this issue on OCP 4.11.13 as well, I confirm that @awgreene olm image fix the high cpu load.

kcalmond commented 1 year ago

Also confirming this image fixed high cpu consumption (OCP v4.11.20)

imageID: >-
        quay.io/agreene/olm@sha256:2a7a8754e1bbf3e96e27cbfd35aed8811e4d32338a751818f054ee213da1a95d
      image: 'quay.io/agreene/olm:operator-api-spam'

kcalmond commented 1 year ago

I noticed same high OLM CPU usage on a 4.10.47 cluster. I restarted the pod using the @awgreene provided image above. It did not change CPU consumption. It runs continuously consuming between ~400-800 mCPU on my 4.10 cluster.

sfritze commented 1 year ago

I noticed same high OLM CPU usage on a 4.10.47 cluster. I restarted the pod using the @awgreene provided image above. It did not change CPU consumption. It runs continuously consuming between ~400-800 mCPU on my 4.10 cluster.

I notice the same behaviour on 4.11.0-0.okd-2023-01-14-152430, its not present on 4.12.0-0.okd-2023-02-04-212953.

awgreene commented 1 year ago

Hello, I don't think allowing this ticket to act as a generic tracker for "OLM CPU Utilization is High" is the best path forward. #2880 fixed a specific issue causing OLM to spam the API server. If you still see OLM using high CPU utilization, please create a new ticket and capture the exact steps to reproduce.

operator-framework / operator-lifecycle-manager

High CPU usage by the olm operator #2874

2697 was created to address an issue where the operator CR status didn't capture all resources associated with an operator. The fix will need to address those needs while not introducing spamming the api server with status updates.