chore: tuning resource usage for operator pod

zdtsw commented 1 month ago

reduce cpu and mem usage
request is some data from testing in a large cluster psi-04
limit is data from attached file in jira 9806
related to https://issues.redhat.com/browse/RHOAIENG-9806

How Has This Been Tested?

Screenshot or short clip

Merge criteria
- [ ] You have read the contributors guide.
- [ ] Commit messages are meaningful - have a clear and concise summary and detailed explanation of what was changed and why.
- [ ] Pull Request contains a description of the solution, a link to the JIRA issue, and to any dependent or related Pull Request.
- [ ] Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
- [ ] The developer has manually tested the changes and verified that the changes work

adelton commented 1 month ago

Is this related to https://issues.redhat.com/browse/RHOAIENG-9806, as some preliminary stop-gap?

some data from testing in a large cluster psi-04

Wouldn't data from small clusters be more relevant for the actual value we aim for?

Do we need for example CPU requests at all? What functionality will suffer when the Operator becomes CPU-starved?

And vice versa -- do we need to lower the limit at all?

zdtsw commented 1 month ago

Is this related to https://issues.redhat.com/browse/RHOAIENG-9806, as some preliminary stop-gap?

some data from testing in a large cluster psi-04

Wouldn't data from small clusters be more relevant for the actual value we aim for?

Do we need for example CPU requests at all? What functionality will suffer when the Operator becomes CPU-starved?

And vice versa -- do we need to lower the limit at all?

I would take a step by step to see if this can make the "large" cluster working first, then we can go even more fine tuning to do the low boundary for "small" cluster.

Do we need for example CPU requests at all? I am not sure i understand this question ? you mean do not set requests.cpu at all? then the operator pod get first throttling or evicted, is this what we want?

to have a high "limit" (to keep what we have now) i would not say do much harm, but it impacts k8s node selection. ofc, if we are talking about SNO i guess there is no such needs for consideration. lower or higher "limit" is the same

VaishnaviHire commented 1 month ago

I agree with @adelton to use data from small clusters to set defaults. The jira issue linked has data from PSAP team

zdtsw commented 1 month ago

tbh, when i started this PR, i did not know this jira ticket. Mainly was from some test we did for another case. Then I recalled we had an old issue regarding resource utilization enhancement, so I submitted this PR after we finalized certain tests.

one thing on my mind after reading your comments: for ticket https://issues.redhat.com/browse/RHOAIENG-9806 , should we use the same data from pref test in ODH? I would assume these data were collected from downstream build. we can use it to set for downstream but how you feel we should use the same value in ODH (if it is not for the sake of sync code)

adelton commented 1 month ago

one thing on my mind after reading your comments: for ticket https://issues.redhat.com/browse/RHOAIENG-9806 , should we use the same data from pref test in ODH? I would assume these data were collected from downstream build. we can use it to set for downstream but how you feel we should use the same value in ODH (if it is not for the sake of sync code)

For the benefit of the folks who might not have access to the internal information, it might be useful to get the data from an ODH installation and share them here or in some other public place, so that the reasons for the numerical changes are documented. I would assume the numbers from ODH and downstream don't differ much, so if we can use and publish the numbers we got for downstream really depends on whether they are considered internal-only or not.

adelton commented 1 month ago

Is this also related to https://issues.redhat.com/browse/RHOAIENG-494?

zdtsw commented 1 month ago

Is this also related to https://issues.redhat.com/browse/RHOAIENG-494?

I dont think so, but more for https://issues.redhat.com/browse/RHOAIENG-9806

adelton commented 1 month ago

I dont think so, but more for https://issues.redhat.com/browse/RHOAIENG-9806

And specifically https://issues.redhat.com/browse/RHOAIENG-10889, it seems.

openshift-ci[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: adelton

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/opendatahub-io/opendatahub-operator/blob/incubation/OWNERS)~~ [adelton] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

opendatahub-io / opendatahub-operator

chore: tuning resource usage for operator pod #1120

How Has This Been Tested?

Screenshot or short clip

Merge criteria