Requests sent to terminating pods

dspeck1 commented 4 months ago

What version of Knative?

1.14.0

0.9.x 0.10.x 0.11.x Output of git describe --dirty

Expected Behavior

Being able to send groups of 200 requests and knative service those requests.

The requests not to be scheduled to terminating pods.

Actual Behavior

Sending in groups of 200 requests to knative. The processing takes 5 minutes on knative to run. All the pods will finish with 200 return code. When a second groups of 200 requests is sent in and pods are terminating many of the requests will return 502 bad gateway errors. The requests are getting scheduled to pods that are terminating.

Steps to Reproduce the Problem

Watch for pods to terminate and send in requests. kourier is the ingress and using the knative autoscaler.

skonto commented 4 months ago

Hi @dspeck1 could you pls provide more info on how to reproduce this eg. ksvc definition, env setup. There was a similar issue in the past but it was not reproducible, status was unclear.

/triage needs-user-input

dspeck1 commented 4 months ago

Hi @skonto. I posted testing code here The app folder has the knative service, the tester folder sends simultaneous requests, and the knative operator config is here To replicate the issue. Send a job with 200 requests, watch for the pods to start to terminate then send job-2 and observe 502 bad gateway errors in the response. It does not happen every time. I have also noticed it does not happen if the pod runs for a short time (10 seconds, 30 seconds). It occurs on long requests like 5 minutes.

The below error is from the queue proxy when this happens. We see the same behavior on Google Cloud GKE and on an on-premise Kubernetes Cluster.

logger: "queueproxy"
message: "error reverse proxying request; sockstat: sockets: used 8
TCP: inuse 3 orphan 11 tw 12 alloc 183 mem 63
UDP: inuse 0 mem 0
UDPLITE: inuse 0
RAW: inuse 0
FRAG: inuse 0 memory 0
"
stacktrace: "knative.dev/pkg/network.ErrorHandler.func1
    knative.dev/pkg@v0.0.0-20240416145024-0f34a8815650/network/error_handler.go:33
net/http/httputil.(*ReverseProxy).ServeHTTP
    net/http/httputil/reverseproxy.go:472
knative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP
    knative.dev/serving/pkg/queue/request_metric.go:199
knative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3.2
    knative.dev/serving/pkg/queue/handler.go:65
knative.dev/serving/pkg/queue.(*Breaker).Maybe
    knative.dev/serving/pkg/queue/breaker.go:155
knative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3
    knative.dev/serving/pkg/queue/handler.go:63
net/http.HandlerFunc.ServeHTTP
    net/http/server.go:2166
knative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4
    knative.dev/serving/pkg/queue/forwarded_shim.go:54
net/http.HandlerFunc.ServeHTTP
    net/http/server.go:2166
knative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4
    knative.dev/serving/pkg/http/handler/timeout.go:118"
timestamp: "2024-05-22T19:42:56.16014057Z"

Below are similar issues I have found: This one mentions a lack of a graceful shutdown for the user-container for in flight requests. Timeout issue on long requests

Thanks for your help! Please let me know anything else you need.

dspeck1 commented 4 months ago

Here is another related issue. https://github.com/knative/serving/issues/9355

dspeck1 commented 4 months ago

@skonto - checking in.

skonto commented 4 months ago

Hi @dspeck1 would you mind checking if this is related to the graceful shutdown in Python https://github.com/knative/serving/issues/12865#issuecomment-1734823618?

dspeck1 commented 3 months ago

Thanks! I was away on vacation and am back now. Testing python container with dumb-init for SIGTERM handling and will let you know.

dspeck1 commented 3 months ago

We are still receiving 502 bad gateway messages with dumb-init.

ashrafguitoni commented 2 months ago

Hi @dspeck1 would you mind checking if this is related to the graceful shutdown in Python #12865 (comment)?

I think that's unrelated... The dumb-init usage allows knative pods to terminate gracefully (otherwise they stay stuck Terminating for the duration of timeoutSeconds as specified in the service specs).

Our team has been having the same issue as @dspeck1 for a long time. We tried so many things including using nginx unit (which seemed to reduce the frequency of the errors, for some reason), but it still occurs. We get the error sometimes when services are scaled to zero and we send a few requests.

Edit: Actually, it seems like more similar issues have been popping up, and one of them mentions a simple potential fix (don't know if it would work for our cases), so I'll try it, and maybe you could as well: https://github.com/knative/serving/issues/15352#issuecomment-2194761094

Edit 2: setting responseStartTimeoutSeconds did not work, I'm still getting the same error logs in queue proxy...

knative / serving