kubernetes / enhancements

Enhancements tracking repo for Kubernetes
Apache License 2.0
3.44k stars 1.48k forks source link

In-Place Update of Pod Resources #1287

Open vinaykul opened 5 years ago

vinaykul commented 5 years ago

Enhancement Description

Please to keep this description up to date. This will help the Enhancement Team track efficiently the evolution of the enhancement

  1. Identify CRI changes needed for UpdateContainerResources API, define response message for UpdateContainerResources

    • Extend UpdateContainerResources API to return info such as ‘not supported’, ‘not enough memory’, ‘successful’, ‘pending page evictions’ etc.
    • Define expected behavior for runtime when UpdateContainerResources is invoked. Define timeout duration of the CRI call.
      • Resolution: Separate KEP for CRI changes.
      • Discussed draft CRI changes with SIG-Node on Oct 22, and we agreed to do this as an incremental change outside the scope of this KEP, in a new mini-KEP. It does not block implementation of this KEP.
  2. Define behavior when multiple containers are being resized, and UpdateContainerResources fails for one or more containers.

    • One Possible solution:
      • Do not update Status.Resources.Limits if UpdateContainerResources API fails, and keep retrying until it succeeds.
  3. Check with API reviewers if we can keep maps instead list of named sub-objects for ResizePolicy.

    • After discussion with @liggitt , we are going to use list of named subobjects for extensibility.
  4. Can we find a more intuitive name for ResizePolicy?

  5. Can we use ResourceVersion to figure out the ordering of Pod resize requests?

  6. Do we need to add back the ‘RestartPod’ resize policy? Is there a strong use-case for it?

    • Resolution: No.
      • Discussed with SIG-Node on Oct 15th, not adding RestartPod policy for simplicity, will revisit if we encounter problems.

Alpha Feature Code Issues: These are Items and issues discovered during code review that need further discussion and need to be addressed before Beta.

  1. Can we figure out GetPodQOS differently once it is determined on pod create? See https://github.com/kubernetes/kubernetes/pull/102884#discussion_r663280487
  2. How do we deal with a pod that requests 1m/1m cpu requests/limits. See https://github.com/kubernetes/kubernetes/pull/102884#discussion_r662552642
  3. Add internal representation of ContainerStatus.Resources in kubeContainer. Convert it to ContainerStatus.Resources in kubelet_pods generate functions. See https://github.com/kubernetes/kubernetes/pull/102884#discussion_r662534632 and https://github.com/kubernetes/kubernetes/pull/102884#discussion_r663151422 and https://github.com/kubernetes/kubernetes/pull/102884#discussion_r663300123
  4. Can we get rid of resize mutex? Is there a better way to handle resize retries? See https://github.com/kubernetes/kubernetes/pull/102884#discussion_r663160060
  5. Can we recover from resize checkpoint store failures? See https://github.com/kubernetes/kubernetes/pull/102884#discussion_r663245975
  6. CRI clarification for ContainerStatus.Resources and how to handle runtimes that don't support it. See https://github.com/kubernetes/kubernetes/pull/102884#discussion_r663300347
  7. Add real values to dockershim test for ContainerStatus.Resources https://github.com/kubernetes/kubernetes/pull/102884#discussion_r662521121
    • Resolution: Not required due to dockershim deprecation.
  8. Change PodStatus.Resources from v1.ResourceRequirements to *v1.ResourceRequirements
    • Resolution: Fixed
  9. Address all places in the code that has 'TODO(vinaykul)'
  10. Current implementation does not work with node toploogy manager enabled. This limitation is not capturedi in the KEP. Add this to the release documentation for alpha, we will address this in beta. See https://github.com/kubernetes/kubernetes/pull/102884#discussion_r676806049
vinaykul commented 4 years ago

Hi @vinaykul !

Enhancements Lead here, do you still intend to target this for alpha in 1.20?

Thanks! Kirsten

Yes. Sorry I missed responding to this sooner, I've been working on another priority project.

The proposed API changes are not too disruptive and I believe I can pull it into 1.20 looking at the schedule. Please keep it tracked 1.20-alpha. Thanks

MorrisLaw commented 4 years ago

Hey @vinaykul -- 1.20 Enhancements Shadow here 👋

Time permitting, can you please update the KEP to the new format detailed here: https://github.com/kubernetes/enhancements/tree/master/keps/NNNN-kep-template?

This can be completed before the Enhancements Freeze deadline on October 6th.

Thank you for all of your work on this!!

Regards, Jeremy

MorrisLaw commented 4 years ago

Hi @vinaykul ,

Since your Enhancement is scheduled to be in 1.20, please keep in mind the important upcoming dates: Friday, Nov 6th: Week 8 - Docs Placeholder PR deadline Thursday, Nov 12th: Week 9 - Code Freeze

As a reminder, please link all of your k/k PR as well as docs PR to this issue so we can track them.

Best, Jeremy

somtochiama commented 4 years ago

Hello @vinaykul , 1.20 Docs shadow here 👋🏽. Does this enhancement work planned for 1.20 require any new docs or modification to existing docs?

If so, please follows the steps here to open a PR against dev-1.20 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Nov 6th

Also take a look at Documenting for a release to get yourself familiarize with the docs requirement for the release. Thank you!

somtochiama commented 4 years ago

Hi @vinaykul The docs placeholder deadline is almost here. Please make sure to create a placeholder PR against the dev-1.20 branch in the k/website before the deadline

Also, please keep in mind the important upcoming dates:

Thank you!

vinaykul commented 4 years ago

@kikisdeliveryservice @MorrisLaw @SomtochiAma Could you please move this enhancement out to the next release (track it towards 1.21). I've just not had the cycles to work on the updated design, which is still in a bit of flux, but I hope to work out the final details with Tim Hockin in the coming few weeks - he has some cycles now to work on finalizing it. Once finalized, I will implement it around late Nov or early Dec when the work pressure lets up a bit.

Thanks.

kikisdeliveryservice commented 4 years ago

Thanks for the update @vinaykul

vinaykul commented 3 years ago

Hi @kikisdeliveryservice @palnabarun

@thockin has been helping me re-tune the some of the API changes for this feature, and I believe we are quite close to finalizing the KEP update and will make the Feb 9th enhancement freeze. Although the March 9th code-freeze date might be a little tight at this point, I'm also still hopeful I can find the time to implement it.

I just want to confirm we are still tracking this for 1.21 release, and I've not missed crossing any proverbial t's regarding the process. Please let me know.

Thanks, Vinay

annajung commented 3 years ago

Hi @vinaykul 👋 1.21 Enhancements Lead here.

This enhancement is currently not being tracked for the 1.21 release. You will need to work with SIG node, SIG autoscaling, and SIG scheduling who are either sponsoring or participating in this enhancement to agree to opt-in for the release.

Once there is an agreement, please work with one of the leads from SIG node to enter the enhancement information into the 1.21 tracking sheet.

You still have a few more days before the enhancements freeze, Feb 9th.

Please ensure that the KEP is ready for the release by making sure that it has met the following criteria:

From the quick look at the KEP and open PR, it looks like it's missing the PRR requirements.

annajung commented 3 years ago

Hi @vinaykul, 1.21 Enhancements Lead here.

I'm clearing the milestone to reflect that this enhancement was not entered into the tracking sheet before the enhancements freeze, therefore not opting into the 1.21 release.

/milestone clear

ehashman commented 3 years ago

/milestone v1.22

reylejano commented 3 years ago

Hi @vinaykul,

1.22 Enhancement shadow checking in. The Enhancement freeze is coming up at 23:59:59 PST on Thursday 13th May. In reviewing your KEP, several things needs to be addressed:

If you have any questions, please reach out.

vinaykul commented 3 years ago

@reylejano @ehashman I just finished updating the KEPs to bring them in-sync with the latest templates you pointed me to. Please review.

The TODO is for beta. The test-cases list is very comprehensive. But once we implement, we will very likely find some cases that we did not think of, and we will be documenting it to keep in-sync with actual tests that are implemented.

reylejano commented 3 years ago

@vinaykul I see that sig-autoscaling and sig-scheduling are participating sigs for this enhancement. Do either of the sigs need to do any work for this enhancement for the 1.22 release?

ahg-g commented 3 years ago

Ack from sig scheduling to review related PRs.

vinaykul commented 3 years ago

@vinaykul I see that sig-autoscaling and sig-scheduling are participating sigs for this enhancement. Do either of the sigs need to do any work for this enhancement for the 1.22 release?

sig-scheduling will have review the code and sign-off on it. sig-autoscaling will consume it but that will come after this goes in. They will likely do it in next release.

reylejano commented 3 years ago

/sig scheduling

gjtempleton commented 3 years ago

Ack from sig autoscaling.

As @vinaykul said, this shouldn't require any work from our side other than consuming it once it's in place.

sftim commented 3 years ago

Folks, I've got a question about this even though the feature isn't yet in a release. I'm going to ask it anyway. Is there a way for a node to label itself as being capable of adjusting resources for a running Pod?

I'm asking because some Pods don't offer this. They might be running as bare-metal instances behind virtual kubelet, or use a container runtime backed by a partitioning hypervisor like Bao. How would the VPA (etc) know which Pods or Nodes have the relevant capability?

Maybe we can define a well-known label to indicate that a Node is likely to offer a RestartNotRequired resize policy?

reylejano commented 3 years ago

Hi @vinaykul , PR 2474 needs to merge by 23:59:59 pst today, May 13 for this enhancement to make the 1.22 Enhancements Freeze

reylejano commented 3 years ago

Hi @vinaykul, with the merge of #2474 , this enhancement is all set for the 1.22 Enhancements Freeze 🎉

vinaykul commented 3 years ago

Folks, I've got a question about this even though the feature isn't yet in a release. I'm going to ask it anyway. Is there a way for a node to label itself as being capable of adjusting resources for a running Pod?

I'm asking because some Pods don't offer this. They might be running as bare-metal instances behind virtual kubelet, or use a container runtime backed by a partitioning hypervisor like Bao. How would the VPA (etc) know which Pods or Nodes have the relevant capability?

Maybe we can define a well-known label to indicate that a Node is likely to offer a RestartNotRequired resize policy?

IIRC, unless things changed, VPA is opt-in. Please look into whether updateMode = "Off" or no VPA object works for you.

sftim commented 3 years ago

To follow-up the question, let's say I want to make sure that I can use the VPA for in-place updates. Can I specify in my pod template (eg for a Deployment) that the Pods from that template prefer to run on nodes that support this new feature? Will that be a future enhancement?

PI-Victor commented 3 years ago

Hello @vinaykul 👋, 1.22 Docs release lead here. This enhancement is marked as ‘Needs Docs’ for 1.22 release.

Please follow the steps detailed in the documentation to open a PR against dev-1.22 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Fri July 9, 11:59 PM PDT.
 Also, take a look at Documenting for a release to familiarize yourself with the docs requirement for the release. 
Thank you!

vinaykul commented 3 years ago

Hello @vinaykul 👋, 1.22 Docs release lead here. This enhancement is marked as ‘Needs Docs’ for 1.22 release.

Please follow the steps detailed in the documentation to open a PR against dev-1.22 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Fri July 9, 11:59 PM PDT.
 Also, take a look at Documenting for a release to familiarize yourself with the docs requirement for the release. 
Thank you!

@PI-Victor I'll get to the documentation PR by next week.

The implementation PR (work in progress) for this KEP is https://github.com/kubernetes/kubernetes/pull/102884 and is out for review.

reylejano commented 3 years ago

Hi @vinaykul ,

With about 2 weeks until code freeze (July 8, 2021), Can you confirm if the following k/k PR is all that is needed for the implementation of this enhancement for the 1.22 release?

vinaykul commented 3 years ago

Hi @vinaykul ,

With about 2 weeks until code freeze (July 8, 2021), Can you confirm if the following k/k PR is all that is needed for the implementation of this enhancement for the 1.22 release?

@reylejano Not including doc PR (I'm yet to create a placeholder PR for it), yes this PR would be it for the feature.

There is another code contributor @wangchen615 who is working on E2E tests. Ideally, she would create a separate dependent PR directly to k/k but I don't think github allows it. So we have to funnel it through my PR.

reylejano commented 3 years ago

Hi @vinaykul ,

Just a reminder that the 1.22 code freeze starts next week on July 8, 2021. We're tracking the following PR for code freeze:

Also a reminder that the Docs Placeholder PR deadline is on July 9, 2021. Here are instructions on opening a PR against the dev-1.22 branch in the k/website repo

reylejano commented 3 years ago

Hi @vinaykul ,

The 1.22 code freeze starts in a few days on Thursday, July 8 at 18:00 PDT. The following PR and any other PR required for this enhancement for 1.22 has to merge by then:

The Docs Placeholder PR deadline is the day after code freeze on July 9, 2021. Here are instructions on opening a PR against the dev-1.22 branch in thek/website repo

This enhancement is currently at-risk of falling out of 1.22 with kubernetes/kubernetes#102884 open

JamesLaverack commented 3 years ago

Hi, v1.22 Enhancements Lead here. Unfortunately this enhancement has not met the requirements for code freeze as https://github.com/kubernetes/kubernetes/pull/102884 is unmerged and unapproved.

If you still wish to progress this enhancement in v1.22, then please file an exception request.

/milestone clear

salaxander commented 3 years ago

/milestone v1.23

Priyankasaggu11929 commented 3 years ago

Hi @vinaykul! 1.23 Enhancements team here. Just checking in as we approach enhancements freeze on Thursday 09/09. Here's where this enhancement currently stands:

Looks like for this one, we would just need to update the following:

Thanks!

vinaykul commented 3 years ago

@Priyankasaggu11929 Thanks for identifying the exact changes needed. I have update the issue description and the milestones in kep.yaml for this and the related KEPs. (PR https://github.com/kubernetes/enhancements/pull/2948 )

Please let me know if anything else is needed..

Priyankasaggu11929 commented 3 years ago

Looks all good now. Thanks so much for quickly making the changes, @vinaykul. :)

Priyankasaggu11929 commented 3 years ago

Thanks so much @vinaykul. With the KEP PR merged now, this enhancements is ready for the enhancements freeze. :)

jlbutler commented 3 years ago

Hi @vinaykul :wave: 1.23 Docs lead here.

This enhancement is marked as 'Needs Docs' for the 1.23 release.

Please follow the steps detailed in the documentation to open a PR against the dev-1.23 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Thu November 18, 11:59 PM PDT.

Also, if needed take a look at Documenting for a release to familiarize yourself with the docs requirement for the release.

Thanks!

Priyankasaggu11929 commented 3 years ago

Hello @vinaykul 👋

Checking in once more as we approach 1.23 code freeze at 6:00 pm PST on Tuesday, November 16.

Please ensure the following items are completed:

As always, we are here to help should questions come up.

Thank you so much! 🙂

salaxander commented 2 years ago

Hi, 1.23 Enhancements Lead here 👋. With code freeze now in effect, this enhancement has not met the criteria for the freeze and has been removed from the milestone.

As a reminder, the criteria for code freeze is:

Feel free to file an exception to add this back to the release. If you plan to do so, please file this as early as possible.

Thanks! /milestone clear

Priyankasaggu11929 commented 2 years ago

Hello @vinaykul 👋, 1.24 Enhancements team here.

Just checking in as we approach enhancements freeze on 18:00pm PT on Thursday Feb 3rd, 2022. This enhancement is targeting alpha for 1.24, is this correct?

Here’s where this enhancement currently stands:

Looks like for this one, we would just need to update the following:

The status of this enhancement is track as at risk. Please update this issue description with appropriate stages as well. Thank you!

vinaykul commented 2 years ago

Hello @vinaykul 👋, 1.24 Enhancements team here. Looks like for this one, we would just need to update the following:

The status of this enhancement is track as at risk. Please update this issue description with appropriate stages as well. Thank you!

Done. PR https://github.com/kubernetes/enhancements/pull/3153

Priyankasaggu11929 commented 2 years ago

Thanks so much for the update @vinaykul!

Priyankasaggu11929 commented 2 years ago

All of the above criteria are satisfied now that the KEP PR https://github.com/kubernetes/enhancements/pull/3153 has been merged. This enhancement is ready for the upcoming 1.24 enhancements freeze. 🚀

For note, the status of this enhancement is now tracked now. Thank you so much!

nate-double-u commented 2 years ago

Hi @vinaykul :wave: 1.24 Docs lead here.

This enhancement is marked as Needs Docs for the 1.24 release.

Please follow the steps detailed in the documentation to open a PR against the dev-1.24 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Thursday, March 31st, 2022 @ 18:00 PDT.

Also, if needed take a look at Documenting for a release to familiarize yourself with the docs requirement for the release.

Thanks!

Priyankasaggu11929 commented 2 years ago

Hello @vinaykul 👋

I'm just checking in once more as we approach the 1.24 Code Freeze on 18:00 PDT, Tuesday, March 29th 2022

Please ensure the following items are completed:

For note, the status of this enhancement is currently marked as at risk.

Kindly please let me know if I'm missing any open PRs other than the ones I linked above. Thank you so much!

valaparthvi commented 2 years ago

Hi @vinaykul :wave: 1.24 Release Comms team here.

We have an opt-in process for the feature blog delivery. If you would like to publish a feature blog for this issue in this cycle, then please opt in on this tracking sheet.

The deadline for submissions and the feature blog freeze is scheduled for 01:00 UTC Wednesday 23rd March 2022 / 18:00 PDT Tuesday 22nd March 2022. Other important dates for delivery and review are listed here: https://github.com/kubernetes/sig-release/tree/master/releases/release-1.24#timeline.

For reference, here is the blog for 1.23.

Please feel free to reach out any time to me or on the #release-comms channel with questions or comments.

Thanks!

gracenng commented 2 years ago

1.24 Code Freeze is in effect and this enhancement is removed from the release. Please feel free to file an exception

/milestone clear

parul5sahoo commented 2 years ago

Hello @vinaykul 👋, 1.25 Enhancements team here.

Just checking in as we approach enhancements freeze on 18:00 PST on Thursday June 16, 2022.

For note, This enhancement is targeting for stage alpha for 1.25 (correct me, if otherwise)

Here's where this enhancement currently stands:

Looks like for this one, we would need to update the following:

For note, the status of this enhancement is marked as at-risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

vinaykul commented 2 years ago

@parul5sahoo @Priyankasaggu11929 Please review https://github.com/kubernetes/enhancements/pull/3390

parul5sahoo commented 2 years ago

Thanks @vinaykul for updating the test plan will be reviewing it and updating the status accordingly

vinaykul commented 2 years ago

@parul5sahoo @Priyankasaggu11929 Do we have a tool to generate the list of "pkg: date - code-coverage" for the unit test section - either based on selection of packages and files a KEP is expected to touch or based on the files modified in the PR.

I tried using the dashboard page ( https://testgrid.k8s.io/sig-testing-canaries#ci-kubernetes-coverage-unit ) and that page is resource intensive - it blew up the memory on the chrome browser and crashed it, and was very slow with safari. I tried disabling auto refresh etc. Frankly I found that page unusable (and this is on a M1 MacBook Pro with 32G mem which is a beast that can run 5 VMs and deliver 3GB network traffic throughput on virtio-pci for a conference demo next week) . Besides, this seems like a lot of manual work for everyone which IMO wastes everyone's time.

Can this be automated to snapshot the before-and-after coverage on a PR instead of making it part of the KEP? The process of putting this info in the KEP seems disconnected and in a place where it can get stale.