Open jgraettinger opened 4 years ago
An effective mitigation of this behavior is to set a non-zero SetReadTimeout
. That timeout applies only while sniffing an appropriate mux, just after initial connection
Would it be reasonable to add a wg.Done()
before returning the error on line https://github.com/soheilhy/cmux/blob/master/cmux.go#L165? For me this solution seems to work, but I do not know what other potential problems this might cause.
Adding the wg.Done()
only helps me I guess because I know that the for loop inside serve ran at least once first.
Documenting my findings debugging a production issue:
tl;dr is that a client can mess with stopping of a server, because the sniffing mechanism has no notion of draining for connections that have yet to be matched to a sub-listener. The specific scenario I encountered is:
Net effect is that grpc.Server.Stop/GracefulStop() & cmux.Serve() can't return until the client connection is remotely closed.
Not entirely sure what the right behavior here is. My gut take is that cmux Accept() should preserve the exit semantics of the wrapped listener Accept, and return its error even though there our outstanding, still-to-be-sniffed connections.
Collected traces:
crux.Serve has found that the wrapped listener Accept has error’d. It’s trying to return, but is blocked on it's own WG within a defer:
That WG can't finish because a connection thread is stuck waiting to sniff an HTTP/2 header:
Meanwhile, gRPC Serve() is blocked waiting for Accept to return. It must do so before it can notify the gRPC server’s own WG, which is a prerequisite for GracefulStop or Stop to return:
For completeness, here's where GracefulStop is wedged waiting on it's WG, held hostage by grpc.Serve: