knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.54k stars 1.15k forks source link

User container returns 200 but queue proxy returns 502 #14818

Open ashrafguitoni opened 8 months ago

ashrafguitoni commented 8 months ago

/area networking

What version of Knative?

1.12.0

Expected Behavior

Requests complete successfully.

Actual Behavior

Requests fail with a 502 error code even though the user container returns a 200 response. See the Jaeger tracing visualization: image

Looking at the queue proxy logs, I see the following error message:

{
  "severity": "ERROR",
  "timestamp": "2024-01-16T05:50:22.87372292Z",
  "logger": "queueproxy",
  "caller": "network/error_handler.go:33",
  "message": "error reverse proxying request; sockstat: sockets: used 36\nTCP: inuse 23 orphan 0 tw 17 alloc 55 mem 31\nUDP: inuse 0 mem 0\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n",
  "commit": "2659cc3",
  "knative.dev/key": "default/semi-structured-form-workflow-00002",
  "knative.dev/pod": "semi-structured-form-workflow-00002-deployment-6dd9676c96-d9f2z",
  "error": "context canceled",
  "stacktrace": "knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/pkg@v0.0.0-20231023151236-29775d7c9e5c/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:475\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3.2\n\tknative.dev/serving/pkg/queue/handler.go:65\nknative.dev/serving/pkg/queue.(*Breaker).Maybe\n\tknative.dev/serving/pkg/queue/breaker.go:155\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3\n\tknative.dev/serving/pkg/queue/handler.go:63\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"
}

Steps to Reproduce the Problem

Since we're getting this issue with our fairly complex system, I don't have a full reproducible example now, but our team is trying.

What we noticed for sure though is, this issue seems to only happen if the traffic exceeds the revisions' current capacity (so autoscaling is triggered). We use target-burst-capacity: -1 for our services.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

skonto commented 4 months ago

This probably means that requests are timing out and they get cancelled, max-revision-timeout-seconds is 10m by default .

ashrafguitoni commented 2 months ago

@skonto Well, the defaults are set to this:

revision-timeout-seconds: "7200"
max-revision-timeout-seconds: "7201"
revision-response-start-timeout-seconds: "7200"

And the per-revision timeoutSeconds and responseStartTimeoutSeconds are also set to a high value, much higher than 10 mins... Any other reasons why this could be happening?