kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.98k stars 39.63k forks source link

Exclusive CPUs not removed from deleted Pod and put back in the defaultCPUSet. #107074

Open klueska opened 2 years ago

klueska commented 2 years ago

What happened?

The CPUManager has logic to periodically cleanup stale state and reclaim exclusive CPUs from pods that have recently terminated. It does this by querying the system for a list of activePods() and reclaiming CPUs from any pods it is tracking that are not in this list.

This works fine for most pods, but special care needs to be taken to ensure that CPUs are not accidentally reclaimed from pods that have not started yet. Allocation of CPUs to the containers of a pod happens during pod admission (i.e. before the pod is added to the activePods() list), so a simple state variable (pendingAdmissionPod) is used to indicate which pod is currently being admitted and exclude it from cleanup. Since pod admission is serialized, only one pod will ever be pending admission at a given time, and only a single variable is necessary to track this (i.e. whenever a new pod enters the admission loop to have exclusive CPUs granted to it, pendingAdmissionPod is overwritten to point to the new pod, clearing way for the previous one to have its state cleaned up when appropriate).

Unfortunately, this simple procedure can cause problems because pendingAdmissionPod is never reset to nil after the last pod is admitted. This is usually fine, because the next time a pod comes in for admission, it will be overwritten to point to the new pod. But if no new pods come in, then it continues to point to the last pod that we attempted to admit (forever essentially), making it so that we can never cleanup its state if it gets deleted at some point in the future (because it is always treated as an "active" pod so long as it is pointed to by pendingAdmissionPod).

I don't think this issue is critical, since in any practical setting pods will be started and stopped all the time, clearing way for the state of previously admitted pods to be cleaned up. But we should consider a better method of tracking pods that are notYetActiveButPendingAdmission() so that we can eliminate this weird edge case.

What did you expect to happen?

With --cpu-manager-policy=static enabled on a node.

Look at the CPU set assigned to one of the system pods running on that node in a non-Guaranteed QOS class:

$ kubectl exec -it <pod> -- taskset -cp 1
pid 1's current affinity list: 0-255

Create a pod requesting exclusive CPUs:

$ cat pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  restartPolicy: Never
  containers:
  - image: ubuntu:20.04
    name: test-ctr
    command: ["bash", "-c"]
    args: ["sleep 99999"]
    resources:
      limits:
        cpu: 4000m
        memory: 1Gi
$ kubectl apply -f pod.yml
pod/test-pod created

Look at the set of exclusive CPUs granted to it (we should see 4):

$ kubectl exec -it test-pod -- taskset -cp 1
pid 1's current affinity list: 1,2,129,130

Look again at the set of CPU set assigned to the pod in the non-Guaranteed QOS class (the 4 CPUs from above are gone):

$ kubectl exec -it <pod> -- taskset -cp 1
pid 1's current affinity list: 0,3-128,131-255

Delete the test pod:

$ kubectl delete pod test-pod
pod "test-pod" deleted

Look again at the set of CPU set assigned to the pod in the non-Guaranteed QOS class (we are back to the original set):

$ kubectl exec -it <pod> -- taskset -cp 1
pid 1's current affinity list: 0-255

How can we reproduce it (as minimally and precisely as possible)?

Go through the steps above, but the last step doesn't show the original CPU set restored.

$ kubectl exec -it <pod> -- taskset -cp 1
pid 1's current affinity list: 0,3-128,131-255

Anything else we need to know?

No response

Kubernetes version

```console $ kubectl version Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:38:50Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:32:41Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"linux/amd64"} ```

Cloud provider

Baremetal

OS version

No response

Install tools

No response

Container runtime (CRI) and and version (if applicable)

No response

Related plugins (CNI, CSI, ...) and versions (if applicable)

No response

klueska commented 2 years ago

/sig node

cynepco3hahue commented 2 years ago

/cc @cynepco3hahue

SergeyKanzhelev commented 2 years ago

/triage accepted /priority important-longterm

hj-johannes-lee commented 2 years ago

Hello, I am trying to find a sig/node labeled issue I can contribute. Though it doesn't have "help wanted" or "good first issue" label, can I try to contribute something for this issue?

@bart0sh suggested me this issue (and I believe he will help me a lot ;) ), and it seems interesting to me!

I understand at least the issue fully, and some related parts in cpu_manager.go. But, since I am not really experienced in Kubernetes project yet, I may need help!!

cynepco3hahue commented 2 years ago

Feel free to assign it to you yourself, but take into consideration that we are dependent on @smarterclayton work that he mentioned here https://github.com/kubernetes/kubernetes/pull/103979#issuecomment-904836223

hj-johannes-lee commented 2 years ago

Ah, I see.! Thanks!

/assign

hj-johannes-lee commented 2 years ago

I would like to know which code of which file makes a pod be in the list of activePods()..! I have tried many things, but I think it's not possible to do anything without knowing that.! Can anyone help me?

And,, @smarterclayton can you explain once more what would be included in the activePods()? The comment @cynepco3hahue mentioned above is somewhat not clear to me (since I do not know what current activePods() lists).

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

vaibhav2107 commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

vaibhav2107 commented 2 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

You can:

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 1 year ago

The issue has been marked as an important bug and triaged. Such issues are automatically marked as frozen when hitting the rotten state to avoid missing important bugs.

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle frozen

likakuli commented 1 year ago

i think it's a bug and this problem also exist in 1.28.

k8s-triage-robot commented 1 month ago

This issue has not been updated in over 1 year, and should be re-triaged.

You can:

For more details on the triage process, see https://www.kubernetes.dev/docs/guide/issue-triage/

/remove-triage accepted