knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.56k stars 1.16k forks source link

Unexpected retries with error message logged in activator #14370

Open anomam opened 1 year ago

anomam commented 1 year ago

/area networking

Ask your question here:

Hi @dprotaso , as requested in our slack discussion, I'm creating an issue here. Thank you for your help!

issue

We’re using a KafkaSource + knative Service to run long-running jobs (let's say up to 20 minutes runtime per request), and sometimes the requests get retried without the service showing signs that the request failed. In fact, when requests are retried (and often processed by a different service pod), the originals eventually finish as well. Our system ends up with requests being dual-processed. Our services are implemented using Python FastAPI, and the container entrypoint is managed using dumb-init to get proper termination behavior. The service pods are not being restarted or anything (due to OOM errors), from our observation.

Now, the temporary fix (that’s not acceptable in production) is to fix the number of service pods. If there are no pods being terminated, the issue does not occur, and no requests are retried. This is causing us issues in production and we can't really figure out what could be behind it.

We're using version 1.7.1 of Knative serving and eventing.

Are there any potential causes that we should investigate?

reproducing the issue

We've been able to reproduce the issue with some good level of reliability when these conditions are met

The log from the activator

{"severity":"ERROR","timestamp":"2023-09-12T04:10:28.963720601Z","logger":"activator","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 21\nTCP: inuse 9 orphan 0 tw 22 alloc 908 mem 80\nUDP: inuse 0 mem 23\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"9402a71-dirty","knative.dev/controller":"activator","knative.dev/pod":"activator-54cdf744fb-4hdcf","knative.dev/key":"default/dummy-workflow-00001","error":"EOF","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/pkg@v0.0.0-20220818004048-4a03844c0b15/network/error_handler.go:33\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:140\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:301\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\tknative.dev/serving/pkg/activator/handler/handler.go:143\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:96\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\tknative.dev/serving/pkg/activator/net/throttler.go:235\nknative.dev/serving/pkg/queue.(*Breaker).Maybe\n\tknative.dev/serving/pkg/queue/breaker.go:155\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\tknative.dev/serving/pkg/activator/net/throttler.go:225\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\tknative.dev/serving/pkg/activator/net/throttler.go:523\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/handler.go:89\nknative.dev/serving/pkg/activator/handler.(*ConcurrencyReporter).Handler.func1\n\tknative.dev/serving/pkg/activator/handler/concurrency_reporter.go:253\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\ngo.opencensus.io/plugin/ochttp.(*Handler).ServeHTTP\n\tgo.opencensus.io@v0.23.0/plugin/ochttp/server.go:92\nknative.dev/serving/pkg/activator/handler.NewTracingHandler.func1\n\tknative.dev/serving/pkg/activator/handler/tracing_handler.go:38\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2084\nknative.dev/serving/pkg/http.(*RequestLogHandler).ServeHTTP\n\tknative.dev/serving/pkg/http/request_log.go:131\nknative.dev/serving/pkg/activator/handler.(*MetricHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/metric_handler.go:64\nknative.dev/serving/pkg/activator/handler.(*contextHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/context_handler.go:82\nknative.dev/serving/pkg/activator/handler.(*ProbeHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/probe_handler.go:46\nknative.dev/networking/pkg/http/probe.(*handler).ServeHTTP\n\tknative.dev/networking@v0.0.0-20220818010248-e51df7cdf571/pkg/http/probe/handler.go:39\nknative.dev/serving/pkg/activator/handler.(*HealthHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/healthz_handler.go:44\ngolang.org/x/net/http2/h2c.h2cHandler.ServeHTTP\n\tgolang.org/x/net@v0.0.0-20220225172249-27dd8689420f/http2/h2c/h2c.go:104\nnet/http.serverHandler.ServeHTTP\n\tnet/http/server.go:2916\nnet/http.(*conn).serve\n\tnet/http/server.go:1966"}

Cleaning up the error message and the stacktrace from the log above:

error reverse proxying request; sockstat: sockets: used 21  
TCP: inuse 9 orphan 0 tw 22 alloc 908 mem 80  
UDP: inuse 0 mem 23  
UDPLITE: inuse 0  
RAW: inuse 0  
FRAG: inuse 0 memory 0
knative.dev/pkg/network.ErrorHandler.func1  
        knative.dev/pkg@v0.0.0-20220818004048-4a03844c0b15/network/error_handler.go:33  
knative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest.func1  
        knative.dev/serving/pkg/activator/handler/handler.go:140  
net/http/httputil.(*ReverseProxy).ServeHTTP  
        net/http/httputil/reverseproxy.go:301  
knative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest  
        knative.dev/serving/pkg/activator/handler/handler.go:143  
knative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1  
        knative.dev/serving/pkg/activator/handler/handler.go:96  
knative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1  
        knative.dev/serving/pkg/activator/net/throttler.go:235  
knative.dev/serving/pkg/queue.(*Breaker).Maybe  
        knative.dev/serving/pkg/queue/breaker.go:155  
knative.dev/serving/pkg/activator/net.(*revisionThrottler).try  
        knative.dev/serving/pkg/activator/net/throttler.go:225  
knative.dev/serving/pkg/activator/net.(*Throttler).Try  
        knative.dev/serving/pkg/activator/net/throttler.go:523  
knative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP  
        knative.dev/serving/pkg/activator/handler/handler.go:89  
knative.dev/serving/pkg/activator/handler.(*ConcurrencyReporter).Handler.func1  
        knative.dev/serving/pkg/activator/handler/concurrency_reporter.go:253  
net/http.HandlerFunc.ServeHTTP  
        net/http/server.go:2084  
go.opencensus.io/plugin/ochttp.(*Handler).ServeHTTP  
        go.opencensus.io@v0.23.0/plugin/ochttp/server.go:92  
knative.dev/serving/pkg/activator/handler.NewTracingHandler.func1  
        knative.dev/serving/pkg/activator/handler/tracing_handler.go:38  
net/http.HandlerFunc.ServeHTTP  
        net/http/server.go:2084  
knative.dev/serving/pkg/http.(*RequestLogHandler).ServeHTTP  
        knative.dev/serving/pkg/http/request_log.go:131  
knative.dev/serving/pkg/activator/handler.(*MetricHandler).ServeHTTP  
        knative.dev/serving/pkg/activator/handler/metric_handler.go:64  
knative.dev/serving/pkg/activator/handler.(*contextHandler).ServeHTTP  
        knative.dev/serving/pkg/activator/handler/context_handler.go:82  
knative.dev/serving/pkg/activator/handler.(*ProbeHandler).ServeHTTP  
        knative.dev/serving/pkg/activator/handler/probe_handler.go:46  
knative.dev/networking/pkg/http/probe.(*handler).ServeHTTP  
        knative.dev/networking@v0.0.0-20220818010248-e51df7cdf571/pkg/http/probe/handler.go:39  
knative.dev/serving/pkg/activator/handler.(*HealthHandler).ServeHTTP  
        knative.dev/serving/pkg/activator/handler/healthz_handler.go:44  
golang.org/x/net/http2/h2c.h2cHandler.ServeHTTP  
        golang.org/x/net@v0.0.0-20220225172249-27dd8689420f/http2/h2c/h2c.go:104  
net/http.serverHandler.ServeHTTP  
        net/http/server.go:2916  
net/http.(*conn).serve  
        net/http/server.go:1966
github-actions[bot] commented 10 months ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.