knative / serving

Kubernetes-based, scale-to-zero, request-driven compute
https://knative.dev/docs/serving/
Apache License 2.0
5.57k stars 1.16k forks source link

[gRPC/http2 auto-detect] Flakiness and potential connection leak #10962

Open markusthoemmes opened 3 years ago

markusthoemmes commented 3 years ago

In what area(s)?

/area networking

TestHTTPProbeAutoHTTP2 is quite flaky currently and I almost went crazy trying to debug why. The test failures look as if two sequential HTTP requests race. Turns out: They do! Adding a 100ms sleep to the beginning of the handler makes the issue very reproducible.

So what happens (I believe) is that the upgrade probe request gets handled by the h2c handler and is then "stuck" in the upgrade functionality. The request will be dispatched to the handler asynchronously whereas the HTTP2 probe has already returned with a 101 Switch Protocols.

I think this also means we're leaking connections here as our HTTP2 probe actually establishes an HTTP2 connection. The returned response is not indicative of that being closed. I experimented with aborting the connection by cancelling a context on the initial request (didn't work) and by sending actual HTTP2 frames via the connection (didn't work yet either).

To properly fix this, we should make sure that the http2UpgradeProbe function returns in lockstep with the first request actually being done (and the HTTP2 connection closed). Only then can we guarantee the test to work as intended (though I'm not super sure it's "correct" currently) and guarantee that we're not leaking connections (which is my bigger concern currently).

/assign @rafaeltello

markusthoemmes commented 3 years ago

Ha! Writing the HTTP2 preface and then closing the response seems to show the desired result.

evankanderson commented 3 years ago

/assign @markusthoemmes

evankanderson commented 3 years ago

/triage accepted

github-actions[bot] commented 3 years ago

This issue is stale because it has been open for 90 days with no activity. It will automatically close after 30 more days of inactivity. Reopen the issue with /reopen. Mark the issue as fresh by adding the comment /remove-lifecycle stale.

markusthoemmes commented 2 years ago

Not working on this

/unassign

dprotaso commented 2 years ago

/unassign @rafaeltello