xiaoxubeii commented 3 years ago

Enhancement Description

One-line enhancement description (can be used as a release note): Support memory qos with cgroups v2
Kubernetes Enhancement Proposal: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2570-memory-qos
Discussion Link: https://docs.google.com/document/d/1r9Fx2omdMLP8SJ59np-C6uBqeuS4TSbQc5ChmQWMYJ8/edit?usp=sharing
Primary contact (assignee): @xiaoxubeii
Responsible SIGs: node
Enhancement target (which target equals to which milestone):
- Alpha1 release target (x.y): v1.22
- Alpha2 release target (x.y): v1.27
- Beta release target (x.y): blocked for https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?usp=sharing
- Stable release target (x.y):
[x] Alpha1
- [x] KEP (k/enhancements) update PR(s):
- 2571
- [x] Code (k/k) update PR(s):
- kubernetes/kubernetes#102578
- kubernetes/kubernetes/pull/102970
- [x] Docs (k/website) update PR(s):
- https://github.com/kubernetes/website/pull/28566
[x] Alpha2
- [x] KEP (k/enhancements) update PR(s):
- https://github.com/kubernetes/enhancements/pull/3818
- [x] Code (k/k) update PR(s):
- https://github.com/kubernetes/kubernetes/pull/115371
- [x] Docs (k/website) update PR(s):
- https://github.com/kubernetes/website/pull/39853
- https://github.com/kubernetes/website/pull/43183

xiaoxubeii commented 3 years ago

/sig node

xiaoxubeii commented 3 years ago

/assign @xiaoxubeii

MadhavJivrajani commented 3 years ago

Hi! This sounds really interesting and I'd love to help out, please let me know how I can help out with this!

ehashman commented 3 years ago

/stage alpha /milestone v1.22

gracenng commented 3 years ago

Hi @xiaoxubeii 👋 1.22 Enhancements Shadow here.

This enhancement is in good shape, some minor change requests in light of Enhancement Freeze on Thursday May 13th:

Update kep.yaml file to the latest template
In kep.yaml, status is currently provisional instead of implementable
Alpha graduation criteria missing
KEP not merged to master

Thanks!

gracenng commented 3 years ago

Hi @xiaoxubeii 👋 1.22 Enhancements shadow here.

To help SIG's be aware of their workload, I just wanted to check to see if SIG-Node will need to do anything for this enhancement and if so, are they OK with it? Thanks!

xiaoxubeii commented 3 years ago

@gracenng Hey grace, I have updated necessary contents as follows:

[x] update kep.yaml for prr approval
[x] add Alpha graduation criteria

sig-node approvers @derekwaynecarr @mrunalp are reviewing for that. I am waiting for lgtm/approve and merge as implementable.

xiaoxubeii commented 3 years ago

@gracenng sig-node approvers(Derek and Mrunal) have gave lgtm/approve. There are few prr review requests, I have updated and am waiting for next review round. I think we can catch up with the freeze day :)

gracenng commented 3 years ago

Hi @xiaoxubeii , looks like your PRR was approved and the requested changes are all here. I have updated the status of this enhancement to tracked Thank you for keeping me updated!

xiaoxubeii commented 3 years ago

Thanks for your help. Also thanks very much to a lot of valuable review suggestions and helps from @derekwaynecarr @mrunalp @bobbypage @giuseppe @odinuge @johnbelamaric @ehashman Really appreciate that :)

ritpanjw commented 3 years ago

Hello @xiaoxubeii 👋 , 1.22 Docs Shadow here.

This enhancement is marked as Needs Docs for 1.22 release. Please follow the steps detailed in the documentation to open a PR against dev-1.22 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Fri July 9, 11:59 PM PDT. Also, take a look at Documenting for a release to familiarize yourself with the docs requirement for the release.

Thank you!

xiaoxubeii commented 3 years ago

@ritpanjw OK, thanks for reminding.

gracenng commented 3 years ago

Hi @xiaoxubeii 🌞 1.22 enhancements shadow here.

In light of Code Freeze on July 8th, this enhancement current status is tracked, and we're currently tracking kubernetes/kubernetes#102578 kubernetes/kubernetes/pull/102970

Please let me know if there is other code PR associated with this enhancement.

Thanks

xiaoxubeii commented 3 years ago

Hi @xiaoxubeii 🌞 1.22 enhancements shadow here.

In light of Code Freeze on July 8th, this enhancement current status is tracked, and we're currently tracking kubernetes/kubernetes#102578 kubernetes/kubernetes/pull/102970

Please let me know if there is other code PR associated with this enhancement.

Thanks

@gracenng It is all here, thanks.

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

xiaoxubeii commented 2 years ago

This kep has been released.

pacoxu commented 1 year ago

This feature is alpha in v1.22. Any plan to make it beta?

BTW, I have some questions about the KEP.

about the memoryThrottlingFactor factor, this is tricky. The kernel ability is not fully exposed to user API(I mean memoryThrottlingFactor is a node level setting; the app owner cannot set it in pod level.). memory.high is like a soft limit and memory.max is a hard limit. Currently, there is only resource.limit. No resource.softlimit. https://github.com/kubernetes/enhancements/pull/2571#discussion_r631195702 @derekwaynecarr 's comment on it.

Meanwhile, the OOM is controlled by the kernel. If kubelet handles a pod that uses memory that exceeds the limit, it can easily add an OOMKilled event. For kernel killing, kubelet cannot get that directly. If memory.high==resource.lmit, kubelet can kill the pod instead of kernel oom killing.

Is there a performance issue if the throttle factor is too small? For example, some pods like Java will always use ~85% memory and memory.high will take effect continuously. The processes of the cgroup are throttled and put under heavy reclaim pressure. Will this be a risk of this feature?

memory.low vs memory.min: The kernel ability is not fully exposed to user API here as well. memory.low is like a soft limit and memory.request is a hard request. There is a comment in https://github.com/kubernetes/enhancements/pull/2571#pullrequestreview-658390067 @mrunalp. Currently, there is only resource.request. No resource.softrequest.

ehashman commented 1 year ago

This should not have been closed as the feature is merely alpha. It either needs to continue graduating or would be deprecated. There is more work to do in either case.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

sftim commented 1 year ago

/remove-lifecycle rotten

What are the barriers to moving this forward to, for example, beta and off by default?

sftim commented 1 year ago

Should we freeze this issue?

(I'd assume than all alpha and beta features that ship in Kubernetes should have their KEP issues frozen, so that we continue to track the associated work)

xiaoxubeii commented 1 year ago

This feature is alpha in v1.22. Any plan to make it beta?

BTW, I have some questions about the KEP.

about the memoryThrottlingFactor factor, this is tricky. The kernel ability is not fully exposed to user API(I mean memoryThrottlingFactor is a node level setting; the app owner cannot set it in pod level.). memory.high is like a soft limit and memory.max is a hard limit. Currently, there is only resource.limit. No resource.softlimit. KEP-2570: Support memory qos with cgroups v2 #2571 (comment) @derekwaynecarr 's comment on it.

Yes. memory.high is more like a soft limit of memory, so we simply set (limits.memory or node allocatable memory)*memorThrottlingFactor in alpha version, which memorThrottlingFactor is 0.8 by default at this moment. In other words, we made (limits.memory or node allocatable memory) as resource.softlimit.

Meanwhile, the OOM is controlled by the kernel. If kubelet handles a pod that uses memory that exceeds the limit, it can easily add an OOMKilled event. For kernel killing, kubelet cannot get that directly. If memory.high==resource.lmit, kubelet can kill the pod instead of kernel oom killing.

kubelet will never KILL THE POD in this case, yet kernel does. Kernel will kill the container which memory usage is over limits.memory, set to memory.max in cgv2. There are no additional implementations of the kubelet in this KEP, other than setting correct values to cgv2.

Is there a performance issue if the throttle factor is too small? For example, some pods like Java will always use ~85% memory and memory.high will take effect continuously. The processes of the cgroup are throttled and put under heavy reclaim pressure. Will this be a risk of this feature? Yes, maybe. The default throttle factor which is .8 here is an experimental value. We will expose it to kubelet startup parameters when appropriate, maybe at beta stage.

memory.low vs memory.min: The kernel ability is not fully exposed to user API here as well. memory.low is like a soft limit and memory.request is a hard request. There is a comment in KEP-2570: Support memory qos with cgroups v2 #2571 (review) @mrunalp. Currently, there is only resource.request. No resource.softrequest.

Actually memory.low is not yet used in the KEP.

pacoxu commented 1 year ago

Just an idea on the factor.

code example(not tested yet): https://github.com/kubernetes/kubernetes/compare/master...pacoxu:cgroupv2-memory?expand=1

Current proposal is memory.high = memory.limit * kubelet memoryThrottlingFactor

However, this would be a problem when the requested memory is close to the memory limit, and the throttling is too easy to reach. When users want to set the factor smaller, it may make no throttling due to high<request not being accepted.

For instance, if the request memory is 500Mi and the limit is 1Gi and the factor is 0.6.
- Then the memory.high would be 600Mi. Just 100Mi higher than the requested memory.
Another instance, if the request memory is 800Mi, and the limit is 1Gi, for factor 0.6, no memory.high will be set on this case.
If request memory is not set, the high.memory can be assigned to 600Mi

A new proposal would be to make the factor based on the requested memory. memory.high = memory.request + (memory.limit - memory.request) * kubelet memoryThrottlingFactor With the first example above,

The memory.high would be 500Mi + 300Mi = 800Mi, which would be a better factor working still.
For the second instance, if request memory is 800Mi, and the limit is 1Gi, for factor 0.6, memory.high will be 920Mi.
If request memory is not set, the high.memory can be set to 600Mi. No change for this case.

Is this a better factor design? I'm not sure if there are other scenarios for throttling the memory.

Limit 1000Mi\ Request \ factor	current design: memory.high	my proposal: memory.high
request 0Mi factor 0.6	600Mi	600Mi
request 500Mi factor 0.6	600Mi	800Mi
request 800Mi factor 0.6	max	920Mi
request 1Gi factor 0.6	max	max
request 0Mi factor 0.8	800Mi	800Mi
request 500Mi factor 0.8	800Mi	900Mi
request 800Mi factor 0.8	max	960Mi
request 500Gi factor 0.4	max	700Mi
calculation memory.high method	`memory.limit * kubelet memoryThrottlingFactor`	`memory.request + (memory.limit - memory.request) * kubelet memoryThrottlingFactor`

giuseppe commented 1 year ago

Current proposal is memory.high = memory.limit * kubelet memoryThrottlingFactor

how to pick the magic number memoryThrottlingFactor? I don't think a single value would work fine for all workloads.

I think it is better if these settings are exposed to the user.

IMHO, memory.high should be mapped to something like softrequest.

bobbypage commented 1 year ago

I shared a bit more context about this feature in my and @mrunalp kubecon talk regarding cgroupv2.

One the interesting pieces feedback I received about the feature is that some folks may have applications which are quite latency / performance sensitive and are always using very close to the memory limit. In those cases, customers would prefer to not have a soft memory limit set so their application does not get impacted by kernel memory reclaim when hitting memory.high. So one thing to consider is if we need some way to opt-out on a per pod basis, or some similar mechanism.

pacoxu commented 1 year ago

One the interesting pieces feedback I received about the feature is that some folks may have applications which are quite latency / performance sensitive and are always using very close to the memory limit. In those cases, customers would prefer to not have a soft memory limit set so their application does not get impacted by kernel memory reclaim when hitting memory.high. So one thing to consider is if we need some way to opt-out on a per pod basis, or some similar mechanism.

Yes, especially for some Java applications, they may not need it and it is necessary to have an option per pod for this.

BTW, as you mentioned, the memory.high may have side effects on some applications, the default 0.8 memoryThrottlingFactor may be not a good default value. I would like to set the default to 0 which means no memory.high should be set by default or set to a higher value like 0.99 or 0.9 to avoid performance issues.

xiaoxubeii commented 1 year ago

One the interesting pieces feedback I received about the feature is that some folks may have applications which are quite latency / performance sensitive and are always using very close to the memory limit. In those cases, customers would prefer to not have a soft memory limit set so their application does not get impacted by kernel memory reclaim when hitting memory.high. So one thing to consider is if we need some way to opt-out on a per pod basis, or some similar mechanism.

Yes, especially for some Java applications, they may not need it and it is necessary to have an option per pod for this.

BTW, as you mentioned, the memory.high may have side effects on some applications, the default 0.8 memoryThrottlingFactor may be not a good default value. I would like to set the default to 0 which means no memory.high should be set by default or set to a higher value like 0.99 or 0.9 to avoid performance issues.

Like bobby said, we need find a way to opt-out it on a per pod level, which means users can enable or disable it by setting something like soft request in pod.

pacoxu commented 1 year ago

I tried to summarize things in https://docs.google.com/document/d/1p9awiXhu5f4mWsOqNpCX1W-bLLlICiABKU55XpeOgoA/edit?usp=sharing for discussions.

SergeyKanzhelev commented 1 year ago

I tried to summarize things in https://docs.google.com/document/d/1p9awiXhu5f4mWsOqNpCX1W-bLLlICiABKU55XpeOgoA/edit?usp=sharing for discussions.

We discussed it at SIG Node meeting on Dec 6th: https://docs.google.com/document/d/1Ne57gvidMEWXR70OxxnRkYquAoMpt56o75oZtg-OeBg/edit#bookmark=id.ph8vc9vhlcxy with the summary: “Collect feedback in the doc and then open a KEP update with alternatives”, discussion with timestamp: https://youtu.be/t3PcHj62f0c?t=686

pacoxu commented 1 year ago

This is now discussed in https://docs.google.com/document/d/1r-e_jWL5EllBgxtaNb9qI2eLOKf9wSuuuglTySWmfmQ/edit#heading=h.6nyswlzgrqq0.

I opened a draft PR https://github.com/kubernetes/kubernetes/pull/115371 accordingly.

mrunalp commented 1 year ago

/label lead-opted-in

mrunalp commented 1 year ago

/stage beta

mrunalp commented 1 year ago

/milestone v1.27

fsmunoz commented 1 year ago

Hello @xiaoxubeii 👋, v1.27 Enhancements team here.

Just checking in as we approach enhancements freeze on 18:00 PDT Thursday 9th February 2023.

This enhancement is targeting for stage beta for 1.27 (please correct me, if otherwise)

Here's where this enhancement currently stands:

[ ] KEP readme using the latest template has been merged into the k/enhancements repo.
[ ] KEP status is marked as implementable for latest-milestone: 1.27
[ ] KEP readme has a updated detailed test plan section filled out
[X] KEP readme has up to date graduation criteria
[ ] KEP has a production readiness review that has been completed and merged into k/enhancements.

For this KEP, we would need to update the following:

Update the KEP to use the latest template.
Update kep.yaml to reflect current stage information. This assumes that the description in the issue that states that the v1.27 is for beta is correct, and that the kep.yaml information that v1.27 is for alpha is not - if the opposite is true then no change would be needed in the file.
Update the test plan content (this will be implicitly done when updating to the latest KEP template)
Fix formatting in the Graduation Criteria section.
The PRR content doesn't include mandatory sections for beta, present in the latest template.

The status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

johnbelamaric commented 1 year ago

@pacoxu @mrunalp @xiaoxubeii any updates here? time is running very short, please get a KEP PR up soon resolving the issues described above, if you still want to hit beta in 1.27.

pacoxu commented 1 year ago

@johnbelamaric We improved the design but think the feature should still be alpha in "v1.27"

Does this still need a PRR review?

bobbypage commented 1 year ago

Thanks @johnbelamaric. We do have a PR with the content you mentioned - https://github.com/kubernetes/enhancements/pull/3818.

Is there anything missing from that? And as @pacoxu, we aim to keep this in alpha for 1.27.

johnbelamaric commented 1 year ago

I'll take a look, I didn't see it before, sorry.

johnbelamaric commented 1 year ago

Ok, since you are staying in alpha and there are no material changes to the PRR content, no new review or approval is needed.

fsmunoz commented 1 year ago

Hi @xiaoxubeii @pacoxu , any updates on the above? I'm doing my best to move things into tracked. Thanks!

SergeyKanzhelev commented 1 year ago

PRR approved:

https://github.com/kubernetes/enhancements/blob/472a381fb32767bf65cbd28b62c04ee655cc2408/keps/prod-readiness/sig-node/2570.yaml#L1-L3

Implementable, stage, and latest milestone are set:

https://github.com/kubernetes/enhancements/blob/472a381fb32767bf65cbd28b62c04ee655cc2408/keps/sig-node/2570-memory-qos/kep.yaml#L14-L21

KEP readme needs to be updated to the latest template:

old test plan section is used: https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2570-memory-qos/README.md#test-plan
What it should look like: https://github.com/kubernetes/enhancements/blob/master/keps/NNNN-kep-template/README.md#test-plan

SergeyKanzhelev commented 1 year ago

@fsmunoz this one is good to go now

fsmunoz commented 1 year ago

Thank you @SergeyKanzhelev.

[X] KEP readme using the latest template has been merged into the k/enhancements repo.
[X] KEP status is marked as implementable for latest-milestone: 1.27
[X] KEP readme has a updated detailed test plan section filled out
[X] KEP readme has up to date graduation criteria
[X] KEP has a production readiness review that has been completed and merged into k/enhancements.

This enhancement is ready to be traced for graduation to ~~beta~~ alpha in v1.27.

/label tracked/yes

xiaoxubeii commented 1 year ago

@SergeyKanzhelev @fsmunoz It is decided to keep alpha in v1.27 but with several updates. Please note that :)

SergeyKanzhelev commented 1 year ago

/stage alpha

pacoxu commented 1 year ago

https://github.com/kubernetes/kubernetes/pull/115371 is submitted to sync the code with the KEP update in v1.27.

pacoxu commented 1 year ago

/assign

pacoxu commented 1 year ago

https://github.com/kubernetes/enhancements/pull/3818#issuecomment-1421729105 Another thought based on the kernel feature Introduce an memcg interface to trigger memory reclaim on a memory cgroup..

for evictions that are based on memory usage, it would be nice to trigger a reclaim before the eviction manager calculates the current pod memory usage with its request/limit.

fsmunoz commented 1 year ago

Hi @xiaoxubeii 👋,

Checking in as we approach 1.27 code freeze at 17:00 PDT on Tuesday 14th March 2023.

Please ensure the following items are completed:

[X] All PRs to the Kubernetes repo that are related to your enhancement are linked in the above issue description (for tracking purposes).
[X] All PRs are fully merged by the code freeze deadline.
https://github.com/kubernetes/kubernetes/pull/102578
https://github.com/kubernetes/kubernetes/pull/102970

For this enhancement, it looks like all linked PRs are merged.

Please let me know what other PRs in k/k I should be tracking for this KEP.

As always, we are here to help should questions come up. Thanks!

mickeyboxell commented 1 year ago

@xiaoxubeii the PR in the description looks like an old Docs PR from 1.22: https://github.com/kubernetes/website/pull/28566. Is there a newer PR for 1.27? Yesterday was the PRs ready for review deadline. If there's not a PR for 1.27, please create and populate one as soon as possible.

kubernetes / enhancements

Support memory qos with cgroups v2 #2570

Enhancement Description

2571