kubernetes / ingress-nginx

Ingress NGINX Controller for Kubernetes
https://kubernetes.github.io/ingress-nginx/
Apache License 2.0
17.57k stars 8.27k forks source link

`Warning Event: Readiness probe failed: HTTP probe failed with statuscode: 500` occured while upgrade #12401

Open dbbDylan opened 3 days ago

dbbDylan commented 3 days ago

What happened:

The system operates smoothly if no requests are sent during the upgrade period. However, when using Grafana K6 to monitor the frequency of HTTPS requests, an error occurs as the new controller pod is fully initialized and the old pod begins to terminate. This issue only lasts for a brief moment, yet it can be consistently reproduced.

Here is the warning event: readiness-probe-failed

And here is the K6 test log:

$ sh run.sh

         /\      Grafana   /‾‾/
    /\  /  \     |\  __   /  /
   /  \/    \    | |/ /  /   ‾‾\
  /          \   |   (  |  (‾)  |
 / __________ \  |_|\_\  \_____/

     execution: local
        script: script.js
        output: -

     scenarios: (100.00%) 1 scenario, 1024 max VUs, 2m30s max duration (incl. graceful stop):
              * default: 1024 looping VUs for 2m0s (gracefulStop: 30s)

WARN[0087] Request Failed                                error="Post \"http://my-hostname/v1/tests/post\": EOF"                                                                                                                          
WARN[0087] Request Failed                                error="Post \"http://my-hostname/v1/tests/post\": read tcp 10.59.89.82:59064->10.47.104.129:80: wsarecv: An existing connection was forcibly closed by the remote host."        
WARN[0087] Request Failed                                error="Post \"http://my-hostname/v1/tests/post\": EOF"                                                                                                                                    
WARN[0087] Request Failed                                error="Post \"http://my-hostname/v1/tests/post\": read tcp 10.59.89.82:59082->10.47.104.129:80: wsarecv: An existing connection was forcibly closed by the remote host."                                                                                                                          
WARN[0087] Request Failed                                error="Post \"http://my-hostname/v1/tests/post\": EOF"                                                                                                                          

     data_received..................: 37 MB 295 kB/s
     data_sent......................: 12 MB 93 kB/s
     http_req_blocked...............: avg=23.5ms   min=0s       med=0s    max=731.46ms p(90)=0s    p(95)=510.49µs
     http_req_connecting............: avg=14.79ms  min=0s       med=0s    max=343.54ms p(90)=0s    p(95)=0s
     http_req_duration..............: avg=2.81s    min=3.12ms   med=2.8s  max=10.18s   p(90)=4.82s p(95)=5.07s
       { expected_response:true }...: avg=2.81s    min=313.71ms med=2.81s max=10.18s   p(90)=4.83s p(95)=5.07s
     http_req_failed................: 0.26% 117 out of 43956
     http_req_receiving.............: avg=468.21µs min=0s       med=0s    max=14.93ms  p(90)=987µs p(95)=2.21ms
     http_req_sending...............: avg=21.26µs  min=0s       med=0s    max=8.52ms   p(90)=0s    p(95)=0s
     http_req_tls_handshaking.......: avg=0s       min=0s       med=0s    max=0s       p(90)=0s    p(95)=0s
     http_req_waiting...............: avg=2.81s    min=3.12ms   med=2.8s  max=10.18s   p(90)=4.82s p(95)=5.07s
     http_reqs......................: 43956 350.979203/s
     iteration_duration.............: avg=2.83s    min=13.56ms  med=2.82s max=10.18s   p(90)=4.85s p(95)=5.09s
     iterations.....................: 43956 350.979203/s
     vus............................: 10    min=10           max=1024
     vus_max........................: 1024  min=1024         max=1024

running (2m05.2s), 0000/1024 VUs, 43956 complete and 0 interrupted iterations                                                                                                                                                                         
default ✓ [======================================] 1024 VUs  2m0s

During this period, I encounter numerous empty responses, and there are no error logs in the ingress-nginx-controller pod. However, if a TCP connection has been established prior to this, it remains uninterrupted (tested it by telnet ${my-tcp-service} ${port} command).

So I want to confirm if it's the upgrade caused short-lived service interruption of the ingress-nginx-controller?

What you expected to happen:

No warnings should occur throughout the upgrade process, and any requests should be handled whether or not the returned status code is 200.

NGINX Ingress controller version (exec into the pod and run /nginx-ingress-controller --version): v1.11.2 & v1.11.3

Kubernetes version (use kubectl version):

Client Version: v1.30.2 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: v1.30.5

Environment:

$ GUID=1
$ DATETIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
$ curl -X POST "http://my-host/v1/tests/post" -H "Content-Type: application/json" -d "{
    \"id\": \"$GUID\", 
    \"create_time\": \"$DATETIME\",
    \"sleep_time_ms\": 10
}"
{"id":"1","ingress_pod_name_form":"ingress-nginx-controller-67fbb67c7b-tpfpt","create_time":"2024-11-22T09:58:05Z","receive_time":"2024-11-22T09:58:52.425334756Z","finish_time":"2024-11-22T09:58:52.435498011Z","consume_sec":0.010163236}
$ curl -vX POST "http://my-host/v1/tests/post" -H "Content-Type: application/json" -d "{   
    \"id\": \"$GUID\",
    \"create_time\": \"$DATETIME\",
    \"sleep_time_ms\": 10
}"
Note: Unnecessary use of -X or --request, POST is already inferred.
* Host my-host:80 was resolved.
* IPv6: (none)
* IPv4: 10.47.104.129
*   Trying 10.47.104.129:80...
* Connected to dylan-test.gtlc.only.sap (10.47.104.129) port 80
* using HTTP/1.x
> POST /v1/tests/post HTTP/1.1
> Host: dylan-test.gtlc.only.sap
> User-Agent: curl/8.10.1
> Accept: */*
> Content-Type: application/json
> Content-Length: 87
>
* upload completely sent off: 87 bytes
< HTTP/1.1 200 OK
< Date: Fri, 22 Nov 2024 10:00:18 GMT
< Content-Type: application/json; charset=utf-8
< Content-Length: 236
< Connection: keep-alive
< Access-Control-Allow-Credentials: true
< Access-Control-Allow-Headers: Content-Type, Content-Length, Accept-Encoding, X-CSRF-Token, Authorization, accept, origin, Cache-Control, X-Requested-With
< Access-Control-Allow-Methods: POST, OPTIONS, GET, PUT, DELETE
< Access-Control-Allow-Origin: *
< X-Ingress-Pod-Name-From: ingress-nginx-controller-67fbb67c7b-tpfpt
< X-Ingress-Pod-Name: ingress-nginx-controller-67fbb67c7b-tpfpt
<
{"id":"1","ingress_pod_name_form":"ingress-nginx-controller-67fbb67c7b-tpfpt","create_time":"2024-11-22T09:58:05Z","receive_time":"2024-11-22T10:00:18.300485598Z","finish_time":"2024-11-22T10:00:18.310760665Z","consume_sec":0.010275065}* Connection #0 to host my-host left intact

How to reproduce this issue:

To reproduce it, you just need one web-service (any pod can receive HTTP request is ok). Then you can use this K6 script:

import http from 'k6/http';
import { uuidv4 } from 'https://jslib.k6.io/k6-utils/1.4.0/index.js';

export const options = {
  vus: 1024,
  duration: '120s',
};

function getFormattedDateTimeNow() {
  const now = new Date();
  const isoString = now.toISOString();

  return isoString;
}

function formattedResponseOutput(res) {
  const status = res.status;
  const statusText = res.status_text;
  const to = res.headers['X-Ingress-Pod-Name'];
  const from = res.headers['X-Ingress-Pod-Name-From'];

  if (res.status != 200) {
    console.log(`[${from}] --> [${to}] : { Status: ${status}, Status Text: ${statusText} }`);
  } else {
    console.log(`[${from}] --> [${to}] : { Status: ${status}, ResponseBody: ${res.body} }`);
  }
}

export default function () {
  const url = 'http://my-host/v1/tests/post';
  const sleep_upper_limit_ms = 5000

  const playload = JSON.stringify({
    "id": uuidv4(),
    "create_time": getFormattedDateTimeNow(),
    "sleep_time_ms": Math.floor(Math.random() * (sleep_upper_limit_ms + 1)), 
  })

  const params = {
    headers: {
      'Content-Type': 'application/json',
    },
  };

  const res = http.post(url, playload, params);
  formattedResponseOutput(res);
}

Anything else we need to know:

You can use my test image implemented by Go: image: doublebiao/web-service-gin:v1.0-beta

k8s-ci-robot commented 3 days ago

This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
longwuyuan commented 3 days ago

/remove-kind bug /kind support

You have only one pod of the controller so yes, you will get brief disruption during upgrade.

You can experiment with more than one replicas and the values for minAvailable etc.

dbbDylan commented 20 hours ago

Thanks for your support! @longwuyuan

As your suggestions here, I try to change my value-specific.yaml:

+    replicaCount: 2

But the same error still occurred when the old pod switch to terminating:

image

I also try to add sleep 15 before executing the wait-shutdown, but it also not work.

longwuyuan commented 20 hours ago

Those are not the only values. Please explore others. Each use case is specific . For example I suggested but your response says you tried only one of my suggestions. Like increase replicas to maybe 3 and set minAvailable to 1 https://kubernetes.io/docs/tasks/run-application/configure-pdb/. This is for having at least 1 pod for new conections

If its about graceful draining of established connections, then please look at other such config options for timers etc. There is no well-documented use case with the controller for this. Each user finds their most suitable config by trial and error.

dbbDylan commented 18 hours ago

I've tried a lot of ways:

All of them are not works.

However, I have found that all the errors are coming from the old pod when executing the “wait-shutdown” script. The old pod still receives messages when the controller is shutting down and before nginx terminates, but this is not as expected:

image

So I don't think it's a configuration issue, but rather a brief service interruption during graceful termination. In my opinion, the expected process maybe like:

  1. Graceful termination started.
  2. Network traffic changed.
  3. (Old pod stops receiving requests) Nginx service stopped.
  4. Old pod deleted.

But the current stage can't guarantee the second step happened before the third step. Could you double-check it?

Thanks for your strong support again.

dbbDylan commented 18 hours ago
func (srv *Server) ListenAndServe() error {
    if srv.shuttingDown() {
        return ErrServerClosed  // the fatal error
    }
    addr := srv.Addr
    if addr == "" {
        addr = ":http"
    }
    ln, err := net.Listen("tcp", addr)
    if err != nil {
        return err
    }
    return srv.Serve(ln)
}
dbbDylan commented 17 hours ago

More information updated:

Once a pod transitions from Running to Terminating, the Endpoint associated with the ingress-nginx-controller Service should have completed its IP change. Therefore, I suspect that the issue might not be with the ingress-nginx-controller itself, but rather with the way the k6 load testing tool is handling connections. Could you help me confirm this hypothesis?