kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.12k stars 39.41k forks source link

http probes that do return 200 but hang afterwards do not fail if their size is above 10 Kb #122948

Open Loki-Afro opened 8 months ago

Loki-Afro commented 8 months ago

What happened?

http probes that do return 200 but hang afterwards do not fail

recently we had an issue that a service was getting stuck, it returned http code 200, but it never fully returned as it would be a streaming response, however it just wasn't finished, imagine a html document whose last hasn't been written yet but has

resulting in an request whose response hasn't been fully read by k8s, thus not running in a timeout, thus not a failing probe

if the 10 Kb limit wouldn't be in place we would have detected that issue the relevant code can be found here https://github.com/kubernetes/kubernetes/blob/master/pkg/probe/http/http.go#L34

i am aware that returning html or similar isn't suitable for a probe, however it was our last resort measurement and there are probably a lot of more services out there where this is the default situation

as a workaround we switched to an exec probe that calls a wget with a timeout ... really not the way to go

if you need further info let me know

What did you expect to happen?

failing probes

How can we reproduce it (as minimally and precisely as possible)?

have a service whose response exceeds 10Kb and that did not close the socket yet

        readinessProbe:
          httpGet:
            path: /login
            port: 3100
          timeoutSeconds: 4
          failureThreshold: 3
          periodSeconds: 10

Anything else we need to know?

the relevant code can be found here https://github.com/kubernetes/kubernetes/blob/master/pkg/probe/http/http.go#L34

Kubernetes version

```console Client Version: v1.28.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.26.10 ```

Cloud provider

ionos

OS version

-

Install tools

-

Container runtime (CRI) and version (if applicable)

-

Related plugins (CNI, CSI, ...) and versions (if applicable)

-
HirazawaUi commented 8 months ago

/sig node

Loki-Afro commented 8 months ago

maybe consider adding a warning when the response is larger than 10kb? might also be an option, like its done for the redirects ..

aojea commented 8 months ago

maybe consider adding a warning when the response is larger than 10kb? might also be an option, like its done for the redirects ..

it will be a breaking change

have a service whose response exceeds 10Kb and that did not close the socket yet

if it returns a 200 but exceeds the size and do not close the socket, why the probe should fail?

Loki-Afro commented 8 months ago

@aojea i know its a strange very specific issue, but thats what you get when you are dealing with nodejs

could you elaborate on why adding a warning is breaking change?

aojea commented 8 months ago

could you elaborate on why adding a warning is breaking change?

oh, wait, let me check, I thought that warning will set the pod as unhealthy, will it keep be considered healthy?

Loki-Afro commented 8 months ago

previously i had a http probe which redirected, so it was still in the http code range of 200-399 resulting in a warning, but the probe did not fail, thus i thought its not a breaking change :D

aojea commented 8 months ago

previously i had a http probe which redirected, so it was still in the http code range of 200-399 resulting in a warning, but the probe did not fail, thus i thought its not a breaking change :D

yeah, haha, tried to correct my comment later and you were faster, my bad, I thought it was changing the state, I think is a fair request, do you want to submit a PR? please add an unit test

Loki-Afro commented 8 months ago

@aojea i'll try :) thanks for acknowledging

Loki-Afro commented 8 months ago

@aojea happy review https://github.com/kubernetes/kubernetes/pull/122970 i also added a test case exactly describing the situation i was facing

ndixita commented 7 months ago

/assign @Loki-Afro

ndixita commented 7 months ago

/triage accepted

AnishShah commented 6 months ago

/priority backlog