Closed hjet closed 2 days ago
/sig node
I reproduced a similar issue locally. I will investigate. /assign
I reproduced this issue as follows:
Enable the InPlacePodVerticalScaling
feature gate.
Create a pod which has a container whose resizePolicy
is NotRequired
and a container whose restartPolicy
is RestartContainer
:
After the pod gets started, post a patch to resize containers. Then, post a patch to rollback the resources before the first resize completes:
kubectl patch pod resize-pod --patch '{"spec":{"containers":[{"name":"c1", "resources":{"requests":{"cpu":"50m","memory":"50Mi"},"limits":{"cpu":"150m","memory":"150Mi"}}},{"name":"c2", "resources":{"requests":{"cpu":"350m","memory":"350Mi"},"limits":{"cpu":"450m","memory":"450Mi"}}}]}}'
sleep 1
kubectl patch pod resize-pod --patch '{"spec":{"containers":[{"name":"c1", "resources":{"requests":{"cpu":"100m","memory":"100Mi"},"limits":{"cpu":"200m","memory":"200Mi"}}},{"name":"c2", "resources":{"requests":{"cpu":"200m","memory":"200Mi"},"limits":{"cpu":"300m","memory":"300Mi"}}}]}}'
Then, it takes about three minutes to complete the rollback.
This issue happened as follows:
c1
was resized and c2
was restarted with the new resources.c1
whose resizePolicy
is NotRequired
. … Problem-1c1
, resources looked rolled back. In c2
, its container ID was the older one and the older one was running. However, resize
status in InProgress
. … Problem-2c1
was done in kubelet.c1
, the second resize was completed. c2
looked resized by the first resize request.c2
with restarting the container.This looks the same result as the test output in the issue description where It took 2-3 minutes till c2
passed the cgroup values verification while c1
passed the verification soon.
Problem-1
I guess the second resize for c2
was skipped here:
https://github.com/kubernetes/kubernetes/blob/f386b4cd4a879e8e7c4c255900a755bd0a61f8f0/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L565
apiContainerStatus
was based on the API pod status when the second resize was requested. When it was requested, the old container was still running as c2
. On the other hand, kubeContainerStatus
was the latest status from the runtime. The new container was already running as c2
. So, the container IDs were different.
I don’t think we should refer to API pod status, which may be outdated.
Problem-2
I guess the pod status degradation was similarly caused. The pod status was overwritten with the old status here:
https://github.com/kubernetes/kubernetes/blob/790dfdbe386e4a115f41d38058c127d2dd0e6f44/pkg/kubelet/kubelet.go#L2863
updatedPod
was created from an API pod which had container statuses when the second resize request was posted while the first resize was still running. Though the first resize was already completed in kubelet, the pod status was overwritten with the older status.
I wonder if it would be better to pass apiPodStatus
which is created with the latest runtime status to handlePodResourcesResize()
as an additional argument:
https://github.com/kubernetes/kubernetes/blob/790dfdbe386e4a115f41d38058c127d2dd0e6f44/pkg/kubelet/kubelet.go#L1763
https://github.com/kubernetes/kubernetes/blob/790dfdbe386e4a115f41d38058c127d2dd0e6f44/pkg/kubelet/kubelet.go#L1966
/assign @esotsal
/triage accepted
Hi,
I believe we can define "quickly reverting" when c2 kubectl second patch is sent before c2 container is started @hjet do you agree?.
I'm investigating the issue more deeply now and looking at your comments @hshiina , will comment back after i have done more tests in my lab
Below results using the pod specs shared by @hshiina
Delayed reconcile ( >1.5 minutes )
Timestamp | SyncLoop |
---|---|
11:19:35.417646 | UPDATE (1st patch) |
11:19:36.332750 | c2 ContainerDied |
11:19:36.471802 | UPDATE (2nd patch) |
11:19:37.337900 | c2 ContainerStarted |
11:22:02.588342 | c2 ContainerDied |
11:22:03.591893 | c2 ContainerStarted |
Normal reconcile ( <2 seconds )
Timestamp | SyncLoop |
---|---|
14:17:00.630121 | UPDATE (1st patch) |
14:17:01.558517 | c2 ContainerDied |
14:17:02.561038 | c2 ContainerStarted |
14:17:08.881515 | UPDATE (2nd patch) |
14:17:09.575167 | c2 ContainerDied |
14:17:10.579782 | c2 ContainerStarted |
Thanks for a very good analysis @hshiina ,
I believe root cause for the very long time varying 1-3 minutes, is the backoff. I've pushed #125757 as an attempt to fix this, please note it will still wait for kubelet's next reconcile loop (interval is 1 minute), but this can be linked with the #123940 known issue. What do you think?
@hjet can you please check if #125757 fixes your issue? Please note you will still need to wait for kubelet's next reconcile loop in your test code.
@hshiina reading your proposal i am not sure if when #123940 is fixed it will be needed or perhaps this proposal should be discussed as part of #123940 ?
I don't think the problems I raised in this issue are related to #123940. Even if #123940 is solved, I guess a delay of the step 10 in https://github.com/kubernetes/kubernetes/issues/125205#issuecomment-2143530104 will remain.
I don't think this case should be affected by #123940 because the ResizePolicy
of the container c2
is RestartContainer
. Restarting a container causes PLEG events. Then, resources in container statuses should be updated immediately along with restarting.
I guess my problems are similar to #116970, which information should be referred to, API or runtime.
Thanks for sharing @hshiina , after reading the description of #116970 added it in SIG Node: In Place Pod Vertical Scaling Project backlog, since it is stated this to be blocker for InPlacePodVerticalScaling moving to beta. /cc @tallclair @vinaykul
Until #116970 issue is resolved, what do you think @hshiina for #125757 ? Is it worth it as a remedy to overcome delays due to backoff for this corner case ? ( i.e. not solving the 1 minute delay but solving the case waiting more than kulelet's next reconcile loop delays due to backoff not been reset )
I'm not sure that if #125757 is related to this issue yet. However, I think it is worth working on #125757 because restarting along with pod resizing should not cause backoff. It might be better to create another issue for the backoff problem. This can be reproduced with a pod in https://github.com/kubernetes/kubernetes/issues/125205#issuecomment-2143527472.
for l in `seq 301 310`; do kubectl patch pod resize-pod --patch "{\"spec\":{\"containers\":[{\"name\":\"c2\", \"resources\":{\"limits\":{\"memory\":\"${l}Mi\"}}}]}}"; sleep 3; done
$ kubectl get pod resize-pod -w
NAME READY STATUS RESTARTS AGE
resize-pod 3/3 Running 0 14s
resize-pod 3/3 Running 0 21s
resize-pod 3/3 Running 1 (1s ago) 22s
resize-pod 3/3 Running 1 (3s ago) 24s
resize-pod 3/3 Running 2 (2s ago) 26s
resize-pod 3/3 Running 2 (3s ago) 27s
resize-pod 2/3 CrashLoopBackOff 2 (1s ago) 29s
resize-pod 2/3 CrashLoopBackOff 2 (2s ago) 30s
resize-pod 2/3 CrashLoopBackOff 2 (5s ago) 33s
resize-pod 2/3 CrashLoopBackOff 2 (6s ago) 34s
resize-pod 2/3 CrashLoopBackOff 2 (8s ago) 36s
resize-pod 2/3 CrashLoopBackOff 2 (11s ago) 39s
resize-pod 3/3 Running 3 (13s ago) 41s
<snip>
By the way, the bakcoff issue above does not happen on v1.30 or older. Before #124220 was merged, a stable key for the backoff was changed along with a change of a pod resource spec.
Thanks @hshiina , created https://github.com/kubernetes/kubernetes/issues/125843
The problems I raised can be reproduced without resizing c1 (ResizePolicy
is NotRequired
). In addition, the problems can be caused by just resizing a container twice (not only rollback).
pod.yaml:
Operation:
kubectl patch pod resize-pod --patch '{"spec":{"containers":[{"name":"resize-container", "resources":{"limits":{"memory":"450Mi"}}}]}}'
sleep 1
kubectl patch pod resize-pod --patch '{"spec":{"containers":[{"name":"resize-container", "resources":{"limits":{"memory":"550Mi"}}}]}}'
Excluding the known issue #123940, the problem in this issue is:
If a container whose ResizePolicy
is RestartContainer
is requested to be resized twice quickly (in about one second), it takes one more minute until the second request is done.
I have posted #125958 for problems I mentioned.
I reproduced this issue as follows:
- Enable the
InPlacePodVerticalScaling
feature gate.Create a pod which has a container whose
resizePolicy
isNotRequired
and a container whoserestartPolicy
isRestartContainer
:apiVersion: v1 kind: Pod metadata: creationTimestamp: null labels: run: resize-pod name: resize-pod spec: containers: - image: busybox name: c1 command: - sh - -c - trap "exit 0" SIGTERM; while true; do sleep 1; done resources: requests: cpu: 100m memory: 100Mi limits: cpu: 200m memory: 200Mi resizePolicy: - resourceName: cpu restartPolicy: NotRequired - resourceName: memory restartPolicy: NotRequired - image: busybox name: c2 command: - sh - -c - trap "exit 0" SIGTERM; while true; do sleep 1; done resources: requests: cpu: 200m memory: 200Mi limits: cpu: 300m memory: 300Mi resizePolicy: - resourceName: cpu restartPolicy: NotRequired - resourceName: memory restartPolicy: RestartContainer - image: busybox name: c3 command: - sh - -c - trap "exit 0" SIGTERM; while true; do sleep 1; done resources: requests: cpu: 300m memory: 300Mi limits: cpu: 400m memory: 400Mi resizePolicy: - resourceName: cpu restartPolicy: NotRequired - resourceName: memory restartPolicy: NotRequired restartPolicy: Always
- After the pod gets started, post a patch to resize containers. Then, post a patch to rollback the resources before the first resize completes:
kubectl patch pod resize-pod --patch '{"spec":{"containers":[{"name":"c1", "resources":{"requests":{"cpu":"50m","memory":"50Mi"},"limits":{"cpu":"150m","memory":"150Mi"}}},{"name":"c2", "resources":{"requests":{"cpu":"350m","memory":"350Mi"},"limits":{"cpu":"450m","memory":"450Mi"}}}]}}' sleep 1 kubectl patch pod resize-pod --patch '{"spec":{"containers":[{"name":"c1", "resources":{"requests":{"cpu":"100m","memory":"100Mi"},"limits":{"cpu":"200m","memory":"200Mi"}}},{"name":"c2", "resources":{"requests":{"cpu":"200m","memory":"200Mi"},"limits":{"cpu":"300m","memory":"300Mi"}}}]}}'
Then, it takes about three minutes to complete the rollback.
@esotsal : With resize subresource, we no longer allow resize via normal pod patch. We need to pass --subresource=resize
flag to kubectl patch
. #128296 added support for resize subresource in kubectl. You will need to create a new kubectl binary that contains #128296.
New Steps:
make WHAT="cmd/kubectl"
to create a new kubectl binary.<path to the new kubectl binary> patch pod .... --subresource=resize
to resize podsNew Steps:
1. rebase k8s repo to fetch commits from [[FG:InPlacePodVerticalScaling] Remove restrictions on subresource flag in kubectl commands #128296](https://github.com/kubernetes/kubernetes/pull/128296) 2. `make WHAT="cmd/kubectl"` to create a new kubectl binary. 3. Use `<path to the new kubectl binary> patch pod .... --subresource=resize` to resize pods
Thanks @AnishShah , will take a look on it.
New Steps:
1. rebase k8s repo to fetch commits from [[FG:InPlacePodVerticalScaling] Remove restrictions on subresource flag in kubectl commands #128296](https://github.com/kubernetes/kubernetes/pull/128296) 2. `make WHAT="cmd/kubectl"` to create a new kubectl binary. 3. Use `<path to the new kubectl binary> patch pod .... --subresource=resize` to resize pods
Thanks @AnishShah , will take a look on it.
Worked fine, using following command.
./kubectl patch pod resize-pod --subresource='resize' -p '{"spec":{"containers":[{"name":"c1", "resources":{"requests":{"cpu":"50m","memory":"50Mi"},"limits":{"cpu":"150m","memory":"150Mi"}}},{"name":"c2", "resources":{"requests":{"cpu":"350m","memory":"350Mi"},"limits":{"cpu":"450m","memory":"450Mi"}}}]}}'
sleep 1
./kubectl patch pod resize-pod --subresource='resize' -p '{"spec":{"containers":[{"name":"c1", "resources":{"requests":{"cpu":"100m","memory":"100Mi"},"limits":{"cpu":"200m","memory":"200Mi"}}},{"name":"c2", "resources":{"requests":{"cpu":"200m","memory":"200Mi"},"limits":{"cpu":"300m","memory":"300Mi"}}}]}}
/close
@hjet screencast demonstrating that issue is fixed now (following steps to reproduce), can be found here
@esotsal: Closing this issue.
What happened?
While working on https://github.com/kubernetes/kubernetes/pull/125202 and testing the second case (patching a pod to perform an in-place resize and then quickly reverting the patch before the resize has been actuated), I've discovered some unexpected behavior. In this case the pod eventually reconciles but it takes about 3 minutes, with the following test output:
With the last couple of lines continuing indefinitely until it reads the "rolled back" value from
memory.max
inc2
(3 mins in my case). This also happens inconsistently, you may have to rerun the test case a couple of times to hit it. You can also run the full suite (-ginkgo.focus="Feature:InPlacePodVerticalScaling"
) to hit it instead of the specific case provided below.What did you expect to happen?
After patching forwards and patching backwards, the pod should reach its initial state, and
memory.max
inc2
should be set to its initial value.How can we reproduce it (as minimally and precisely as possible)?
patchAndVerifyAborted
patchAndVerify
lines (so we only run the "aborted" case), and comment this linee2e.test
with-ginkgo.focus="Burstable QoS pod, three containers - decrease c1 resources" -num-nodes 2
(you can also run the full suite (-ginkgo.focus="Feature:InPlacePodVerticalScaling"
) - you may have to run it a couple of times)Anything else we need to know?
This could also be an issue with my test changes, so please flag anything that seems suspicious. If you leave
patchAndVerify
uncommented (so patch -> wait for resize -> patch back -> wait for resize) and then runpatchAndVerifyAborted
(on the same case), you will hit a different bug, which I will file separately.Kubernetes version
Cloud provider
OS version
Install tools
Container runtime (CRI) and version (if applicable)
Related plugins (CNI, CSI, ...) and versions (if applicable)