kublet prober infinite Readiness check - no Liveness probe defeating self-heal

AbeOwlu commented 6 months ago

What happened?

pod(container) Readiness and Liveness probe are non-blocking routines. And if readiness probe is failing, a liveness probe can trigger restart and possibly self-heal.

However, encountered a case where;

coredns pod starts, but an external automation causes IP removal on node. the cni IPAM is forced to sync the resource state and the coredns pod network ns is torn down and rebuilt - container ID change, but pod remains ID unchanged

Feb 28 16:33:53 ... kubelet.go:2456] "SyncLoop (PLEG): event for pod" pod="kube-system/coredns-f88c6698d-zmjgk" event={"ID":"001d42a0-1729-44fc-9959-b6e751ee44d9","Type":"ContainerStarted","Data":"9e324b5e26ac15640355f3dd86bcdb80f81f380827b435abc85365cd67fcc1f2"}
Feb 28 16:33:54 ... kubelet.go:2456] "SyncLoop (PLEG): event for pod" pod="kube-system/coredns-f88c6698d-zmjgk" event={"ID":"001d42a0-1729-44fc-9959-b6e751ee44d9","Type":"ContainerStarted","Data":"86d1a5bcff3978fcbaae12fc6259adade96747cc14c3aa3374102d41b34c1636"}

no startUp probe in coredns, so container ready, and doProbe readiness probe is sent

Feb 28 16:33:54 ... kubelet.go:2528] "SyncLoop (probe)" probe="readiness" status="" pod="kube-system/coredns-f88c6698d-zmjgk"

this http probe fails with a http status code 503, aand a liveness probe is never issued and self-heal/restart triggered

Feb 28 16:33:54 ... prober.go:107] "Probe failed" probeType="Readiness" pod="kube-system/coredns-f88c6698d-zmjgk" podUID="001d42a0-1729-44fc-9959-b6e751ee44d9" containerName="coredns" probeResult="failure" output="HTTP probe failed with statuscode: 503"
Feb 28 16:33:56 ... prober.go:107] "Probe failed" probeType="Readiness" pod="kube-system/coredns-f88c6698d-zmjgk" podUID="001d42a0-1729-44fc-9959-b6e751ee44d9" containerName="coredns" probeResult="failure" output="HTTP probe failed with statuscode: 503"
Feb 29 00:18:22 ... prober.go:107] "Probe failed" probeType="Readiness" pod="kube-system/coredns-f88c6698d-zmjgk" podUID="001d42a0-1729-44fc-9959-b6e751ee44d9" containerName="coredns" probeResult="failure" output="HTTP probe failed with statuscode: 503"

It is just unclear why the liveness probe on coredns spec is never sent is the getWorker here to UpdatePodStatus after checking startup probe not introducing an inadvertent wait for readiness?

What did you expect to happen?

kubernetes self-heal attempt of a pod that is stuck/failing probe

How can we reproduce it (as minimally and precisely as possible)?

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: liveness
  name: liveness-http
spec:
  containers:
  - name: liveness
    image: registry.k8s.io/liveness
    args:
    - /server
    livenessProbe:
      failureThreshold: 5
      httpGet:
        path: /healthz
        port: 8080
        httpHeaders:
        - name: Custom-Header
          value: Awesome
      initialDelaySeconds: 60 || 300
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5
    readinessProbe:
      failureThreshold: 3
      httpGet:
        path: /healthz
        port: 8080
        httpHeaders:
        - name: Custom-Header
          value: Awesome
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
  restartPolicy: Always

test pod above
remove the IP assigned to a pod externally, after starting and readiness probe(unimportant), but before the liveness probe

forcing the node IPAM to re-sync and a similar error is encountered;

Mar 06 20:18:37 ... prober.go:107] "Probe failed" probeType="Readiness" pod="gateway-ns/liveness-http" podUID=868de996-ff69-44bd-b73c-feaeb2234839 containerName="liveness" probeResult=failure output="HTTP probe failed with statuscode: 500"
Mar 06 20:18:47 ... prober.go:107] "Probe failed" probeType="Readiness" pod="gateway-ns/liveness-http" podUID=868de996-ff69-44bd-b73c-feaeb2234839 containerName="liveness" probeResult=failure output="HTTP probe failed with statuscode: 500"
Mar 06 20:18:57 ... prober.go:107] "Probe failed" probeType="Readiness" pod="gateway-ns/liveness-http" podUID=868de996-ff69-44bd-b73c-feaeb2234839 containerName="liveness" probeResult=failure output="HTTP probe failed with statuscode: 500"
Mar 06 20:19:07 ... prober.go:107] "Probe failed" probeType="Readiness" pod="gateway-ns/liveness-http" podUID=868de996-ff69-44bd-b73c-feaeb2234839 containerName="liveness" probeResult=failure output="HTTP probe failed with statuscode: 500"

Anything else we need to know?

No response

Kubernetes version

```console $ kubectl version Client Version: v1.29.1 Kustomize Version: v5.0.4... Server Version: v1.29.1...-eks-... ```

Cloud provider

EKS

OS version

```console # On Linux: $ cat /etc/os-release NAME="Amazon Linux" VERSION="2" ID="amzn" ID_LIKE="centos rhel fedora" VERSION_ID="2" PRETTY_NAME="Amazon Linux 2" ANSI_COLOR="0;33" CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2" HOME_URL="https://amazonlinux.com/" $ uname -a inux ....compute.internal 5.10.198-187.748.amzn2.x86_64 #1 SMP Tue Oct 24 19:49:54 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Container runtime (CRI) and version (if applicable)

Related plugins (CNI, CSI, ...) and versions (if applicable)

k8s-ci-robot commented 6 months ago

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

AbeOwlu commented 6 months ago

/sig Node

AnishShah commented 6 months ago

@AbeOwlu , to reproduce this issue, how to remove the IP assigned to a pod externally and force node IPAM to re-sync? Is this an issue with AWS VPC CNI?

/triage needs-information

AbeOwlu commented 6 months ago

HI @AnishShah , thanks for looking into this...

and you're accurate this was initially seen on aws cni, and this issue was raised with the cni
was testing this on calico cni with calicoctl ipam release --force and it may be a similar state... should confirm this and update with more information soon.
from checking the containerd logs, it does seem the CRI attempts to tear down the container sandbox and recreate it, but the CNI does not responde in the aws cni scenario. So the container orchestrator may actually be handling this case as expected.

k8s-triage-robot commented 3 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 months ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

AnishShah commented 1 month ago

sig-node triage meeting:

@AbeOwlu what state the pod is in? can you share the output of kubectl describe pods? Also can you share kubelet and containerd logs to debug further?

k8s-triage-robot commented 2 weeks ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 weeks ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/kubernetes/issues/123778#issuecomment-2322091209): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

kubernetes / kubernetes