kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.42k stars 39.48k forks source link

readiness probe is not scheduled as configured periodSeconds #118815

Open vkatabat opened 1 year ago

vkatabat commented 1 year ago

What happened?

Below pod yaml file was deployed. From probe logs we see probe is getting scheduled more than once in 30 sec duration of perodSeconds configured for readinessProbe. At times it gets scheduled twice within 5 seconds though 30 sec is configured as periodSeconds apiVersion: v1 kind: Pod metadata: labels: test: readiness name: ready1 spec: containers:

/ # cat /tmp/probe.sh

!/bin/sh

echo "Probe at date" >> /tmp/probe.log exit 1 / # tail -f /tmp/probe.log Probe at Fri Jun 16 07:16:16 UTC 2023 Probe at Fri Jun 16 07:16:35 UTC 2023 Probe at Fri Jun 16 07:17:05 UTC 2023 Probe at Fri Jun 16 07:17:19 UTC 2023 Probe at Fri Jun 16 07:17:35 UTC 2023

What did you expect to happen?

Expect readinessProbe to be scheduled acurately as periodSeconds.

How can we reproduce it (as minimally and precisely as possible)?

Above yaml file can be created in k8s 1.27 with probe content to log output of date command.

Anything else we need to know?

When probe was successful scheduling was accurate.

Kubernetes version

```console $ kubectl version WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short. Use --output=yaml|json to get the full version. Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.1", GitCommit:"c978334577baba04da0671abbcd788e510503453", GitTreeState:"clean", BuildDate:"2023-06-18T06:28:15Z", GoVersion:"go1.20.3", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v5.0.1 Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.1", GitCommit:"c978334577baba04da0671abbcd788e510503453", GitTreeState:"clean", BuildDate:"2023-06-18T06:17:43Z", GoVersion:"go1.20.3", Compiler:"gc", Platform:"linux/amd64"} ```

Cloud provider

sudo dmidecode | grep -i -e manufacturer -e product -e vendor Vendor: SeaBIOS Manufacturer: OpenStack Foundation Product Name: OpenStack Nova Manufacturer: QEMU Manufacturer: QEMU Manufacturer: QEMU Manufacturer: QEMU

OS version

```console # On Linux: $ cat /etc/os-release cat /etc/os-release NAME="SLES" VERSION="15-SP4" VERSION_ID="15.4" PRETTY_NAME="SUSE Linux Enterprise Server 15 SP4" ID="sles" ID_LIKE="suse" ANSI_COLOR="0;32" CPE_NAME="cpe:/o:suse:sles:15:sp4" DOCUMENTATION_URL="https://documentation.suse.com/" $ uname -a Linux master-2-test10 5.14.21-150400.24.63-default #1 SMP PREEMPT_DYNAMIC Tue May 2 15:49:04 UTC 2023 (fd0cc4f) x86_64 x86_64 x86_64 GNU/Linux # On Windows: C:\> wmic os get Caption, Version, BuildNumber, OSArchitecture # paste output here ```

Install tools

Container runtime (CRI) and version (if applicable)

/usr/local/bin/containerd --version containerd github.com/containerd/containerd v1.7.0 1fbd70374134b891f97ce19c70b6e50c7b9f4e0d

Related plugins (CNI, CSI, ...) and versions (if applicable)

vkatabat commented 1 year ago

/sig k8s-infra

SataQiu commented 1 year ago

FYI: https://github.com/kubernetes/kubernetes/blob/ba6d2674ca6b7b4992189bba1516062004f05ed3/pkg/kubelet/prober/prober_manager.go#L283-L288

mochizuki875 commented 1 year ago

@vkatabat Hi, I try to reproduce it, but it dosen't and seems to work well. I used this manifest, is it reproduced in your environment using it? If you see anything odd, please point it out:)

probe-pod.yaml

apiVersion: v1
kind: Pod
metadata:
  labels:
    test: readiness
  name: ready1
spec:
  containers:
    - name: readiness
      image: busybox
      command: ["top"]
      readinessProbe:
        exec:
          command:
          - /mnt/probe.sh
        initialDelaySeconds: 3
        failureThreshold: 1
        successThreshold: 1
        periodSeconds: 30
      volumeMounts:
      - name: probe-sh
        mountPath: "/mnt"
        readOnly: false
  volumes:
  - name: probe-sh
    configMap:
      name: probe-sh
      defaultMode: 0777

---
apiVersion: v1
kind: ConfigMap
metadata:
  name: probe-sh
data:
  probe.sh: |
    #!/bin/sh
    echo Probe at $(date) > /tmp/probe.txt
    exit 1

Apply this manifest.

$ kubectl apply -f probe-pod.yaml
pod/ready1 created
configmap/probe-sh created

After that, probe has failed.

$ kubectl describe po ready1
Name:             ready1
Namespace:        default
...
Events:
  Type     Reason     Age                From               Message
  ----     ------     ----               ----               -------
  Normal   Scheduled  77s                default-scheduler  Successfully assigned default/ready1 to kind-worker
  Normal   Pulling    76s                kubelet            Pulling image "busybox"
  Normal   Pulled     75s                kubelet            Successfully pulled image "busybox" in 1.401711187s (1.401724144s including waiting)
  Normal   Created    75s                kubelet            Created container readiness
  Normal   Started    75s                kubelet            Started container readiness
  Warning  Unhealthy  16s (x3 over 72s)  kubelet            Readiness probe failed:

Check the file and only one record has been added.

$ kubectl exec -it ready1 -- /bin/sh
/ # tail /tmp/probe.txt 
Probe at Mon Jun 26 08:40:01 UTC 2023

My environment is k8s v1.27.3.

$ kubectl version --short
Flag --short has been deprecated, and will be removed in the future. The --short output will become the default.
Client Version: v1.27.3
Kustomize Version: v5.0.1
Server Version: v1.27.3

Thanks.

vkatabat commented 1 year ago

Hi @mochizuki875 I think you used ">" instead of append ">>" in /mnt/probe.sh. You can see in below logs it got scheduled twice in 5 min 12:48:21 & 12:48:26.

Log in my environment:

/ # cat /mnt/probe.sh #!/bin/sh echo Probe at $(date) >> /tmp/probe.log exit 1 / # tail -f /tmp/probe.log Probe at Mon Jun 26 12:47:51 UTC 2023 **Probe at Mon Jun 26 12:48:21 UTC 2023 Probe at Mon Jun 26 12:48:26 UTC 2023** Probe at Mon Jun 26 12:48:51 UTC 2023 Probe at Mon Jun 26 12:49:21 UTC 2023 Probe at Mon Jun 26 12:49:45 UTC 2023 Probe at Mon Jun 26 12:49:51 UTC 2023

Below it got executed twice within 1 sec.

Probe at Mon Jun 26 13:01:21 UTC 2023 Probe at Mon Jun 26 13:01:21 UTC 2023
mochizuki875 commented 1 year ago

@vkatabat Oops! Sorry, it's a mistake and thank you for your comment. I was able to reproduce it as you described and will do some research.

mochizuki875 commented 1 year ago

It seems to be intentional from reading the code. ReadinessProbe will be executed by two triggers

https://github.com/kubernetes/kubernetes/blob/0a82bdbfdb1b8dd644d242b39a257d14fa5bf6b0/pkg/kubelet/prober/worker.go#L179-L189

rf: https://github.com/kubernetes/kubernetes/issues/96731 https://github.com/kubernetes/kubernetes/pull/98376

gjkim42 commented 1 year ago

/remove-sig k8s-infra /sig node /kind bug

vkatabat commented 1 year ago

Thank you @mochizuki875 for your explanation.

mochizuki875 commented 1 year ago

@vkatabat Yeah:) I've created PR related to it (#119089).

vkatabat commented 1 year ago

Hi @mochizuki875, Can the probe be exposed with an environment variable or have an api that can be called from probe handler to know which of the case periodSeconds or manualTrigger that caused the probe to be invoked?

This way pods can implement logic which they might want to execute only when invoked via timer periodSeconds?

mochizuki875 commented 1 year ago

@vkatabat Thank you for your comment.

As my investigation, these environment variable and API don't seem to exist.

Probe is triggered by periodSeconds and manualTrigger only during Pod is not ready, and triggered only by periodSeconds if Pod become ready. According to #96731, I understand that triggering probe by manualTrigger is to make it efficient making Pod ready.

Dose this pose any problems for you?

vkatabat commented 1 year ago

Hi @mochizuki875, ManualTrigger triggering probe early may cause thrashing issues when applications want to deterministically control allowing traffic to pods. We see some application want to use readiness probe failure to bring down the application. If there was momentary recovery on what stats they check to bring down the application. The manualTrigger can cause application to be declared up and causing more load on application.

It could be late to wait periodSeconds probe to fail again and shut the application.

If any API exist that probe can know the context when it was called then i think we can circumvent this thrashing problem?Basically as per k8s documentation i think users are of the opinion that readiness probes are triggerred periodically based on periodSeconds.

mochizuki875 commented 1 year ago

@vkatabat Could you resolve this problem using startupProbe and livenessProbe? readinessProbe checks whether the container is ready to respond to requests and control whether adding or removing the Pod IP address from the endpoints of Services. So In my opinion, it's not for control shutting the application and its interval.

cf) https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#types-of-probe

FYI:

Basically as per k8s documentation i think users are of the opinion that readiness probes are triggerred periodically based on periodSeconds.

I was confused too before investigating this issue:-(

gjkim42 commented 1 year ago

https://github.com/kubernetes/kubernetes/pull/118409 ^ This looks related.

vkatabat commented 1 year ago

Hi @mochizuki875 I think livelinessProbe was evaluated but as it restarts the container it may not suffice. As readinessProbe removes pod endpoint from service it is felt good to deterministically control the opening up the pod for traffic if resource usage is within the limits or shut if resource usage is above threshold.

mochizuki875 commented 1 year ago

@vkatabat Our understanding what readinessProbe is seems to be common. Sorry, since I may not understand what you are saying sufficiently , so could you please give me more concrete use case which you want to do using readinessProbe?

vkatabat commented 1 year ago

Hi @mochizuki875 Our usecase is to stop the pod from exhaust of resources. So we have have put some health check in the script to check the resource usage. If the CPU or Memory is increased to the threshold value set. The script will return a return status 1 else 0.

So we are using the readiness probe for this check and whenever readiness probe fails, the pod will be removed from the service IPs. This prevents the service from putting load to the pod. Once readiness probe succeeds i.e CPU and Memory resources are below the threshold value pod IP will be added back to service.

SergeyKanzhelev commented 1 year ago

/triage accepted /priority backlog

mochizuki875 commented 1 year ago

@vkatabat Thank you for your explanation. However I don't have any good idea for your usecase so I've asked for advice in slack channel. https://kubernetes.slack.com/archives/C0BP8PW9G/p1689906432111149

tsloughter commented 1 year ago

I'm here because I noticed that if I have an initial delay of 0 that the readinessProbe is checked twice on startup in rapid succession -- not waiting the period seconds. I think this is caused by the same behaviour (the manual trigger?) that this issue is about. I actually like that this is how it works (it works out well in my case to get the Pod ready quickly) but found it wasn't documented anywhere I could find.

I went digging because I have mentioned it in this writing, https://adoptingerlang.org/docs/production/kubernetes/#readiness-liveness-and-startup-probes

Despite failing and periodSeconds being 10 it likely won’t take 10 seconds for the Pod to be ready. With an initialDelaySeconds of 0 the readinessProbe is checked immediately during startup and does not wait the configured periodSeconds to check a second time. Because the service does little on startup it will pass the second check. If it doesn’t it will then wait the periodSeconds until it is checked again and the Pod becomes ready.

But without it being officially documented I was worried it could change and I should add a note that this is undefined behaviour.

It sounds like from this issue that it is not necessarily considered a bug, so may not change, but also not defined behaviour?

Also just curious if someone could explain what the flow is, I couldn't understand how the code results in this happening aside from the fact there is this separate trigger from the period.

machine424 commented 5 months ago

I agree that having that hidden manual trigger makes this unpredictable and difficult to reason about. Even if we add a log for that as in https://github.com/kubernetes/kubernetes/pull/119089/files, one will need to increase the log level to see it...

And as was noted, this makes the documentation obsolete and periodSeconds and especially failureThreshold no longer make sense.

Maybe the fix that was merged as part of https://github.com/kubernetes/kubernetes/pull/98376 in regarding startup probes is sufficient for the use case that motivated the addition of this manual trigger for readiness probes.

Or maybe, as the manual trigger is meant to bring the Pod to a ready state quickly, maybe the manual trigger probe failures should not count against failureThreshold, so the params remain meaningful. (this is not the best solution)

SergeyKanzhelev commented 2 months ago

manual trigger meant to make Pod ready faster. We may need to adjust periodics timer after the manual trigger. The logic there definitely can use some polishing