soheilhy / cmux

Connection multiplexer for GoLang: serve different services on the same port!
Apache License 2.0
2.53k stars 197 forks source link

badly-behaved client can deadlock server stopping #76

Open jgraettinger opened 4 years ago

jgraettinger commented 4 years ago

Documenting my findings debugging a production issue:

tl;dr is that a client can mess with stopping of a server, because the sniffing mechanism has no notion of draining for connections that have yet to be matched to a sub-listener. The specific scenario I encountered is:

Net effect is that grpc.Server.Stop/GracefulStop() & cmux.Serve() can't return until the client connection is remotely closed.

Not entirely sure what the right behavior here is. My gut take is that cmux Accept() should preserve the exit semantics of the wrapped listener Accept, and return its error even though there our outstanding, still-to-be-sniffed connections.

Collected traces:

crux.Serve has found that the wrapped listener Accept has error’d. It’s trying to return, but is blocked on it's own WG within a defer:

goroutine 1798 [semacquire, 5 minutes]:
sync.runtime_Semacquire(0xc00066c008)
        /usr/local/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc00066c000)
        /usr/local/go/src/sync/waitgroup.go:130 +0x64
github.com/soheilhy/cmux.(*cMux).Serve.func1(0xc00012a4b0, 0xc00066c000)
        /gazette/.build/go-path/pkg/mod/github.com/soheilhy/cmux@v0.1.4/cmux.go:150 +0x55
github.com/soheilhy/cmux.(*cMux).Serve(0xc00012a4b0, 0x1497b00, 0xc006278640)
        /gazette/.build/go-path/pkg/mod/github.com/soheilhy/cmux@v0.1.4/cmux.go:165 +0x120
go.gazette.dev/core/server.(*Server).QueueTasks.func1(0x20, 0x20)
        /gazette/server/server.go:124 +0x40
go.gazette.dev/core/task.(*Group).GoRun.func1(0x0, 0x0)
        /gazette/task/group.go:72 +0x43
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc000530a80, 0xc001390080)
        /gazette/.build/go-path/pkg/mod/golang.org/x/sync@v0.0.0-20200317015054-43a5402ce75a/errgroup/errgroup.go:57 +0x59
created by golang.org/x/sync/errgroup.(*Group).Go
        /gazette/.build/go-path/pkg/mod/golang.org/x/sync@v0.0.0-20200317015054-43a5402ce75a/errgroup/errgroup.go:54 +0x66

That WG can't finish because a connection thread is stuck waiting to sniff an HTTP/2 header:

goroutine 491775429 [IO wait, 31 minutes]:
internal/poll.runtime_pollWait(0x7f7a20ed50f8, 0x72, 0xffffffffffffffff)
        /usr/local/go/src/runtime/netpoll.go:203 +0x55
internal/poll.(*pollDesc).wait(0xc0065afb98, 0x72, 0x0, 0x9, 0xffffffffffffffff)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        /usr/local/go/src/internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc0065afb80, 0xc001e66498, 0x9, 0x9, 0x0, 0x0, 0x0)
        /usr/local/go/src/internal/poll/fd_unix.go:169 +0x19b
net.(*netFD).Read(0xc0065afb80, 0xc001e66498, 0x9, 0x9, 0x865b8e, 0x10401, 0xc000000000)
        /usr/local/go/src/net/fd_unix.go:202 +0x4f
net.(*conn).Read(0xc0013b8020, 0xc001e66498, 0x9, 0x9, 0x0, 0x0, 0x0)
        /usr/local/go/src/net/net.go:184 +0x8e
github.com/soheilhy/cmux.(*bufferedReader).Read(0xc004b59820, 0xc001e66498, 0x9, 0x9, 0xc000184a80, 0x7f7a9f3327d0, 0x0)
        /gazette/.build/go-path/pkg/mod/github.com/soheilhy/cmux@v0.1.4/buffer.go:53 +0x12c
io.ReadAtLeast(0x1496d40, 0xc004b59820, 0xc001e66498, 0x9, 0x9, 0x9, 0x85db95, 0xc0091bf440, 0xc0091b0004)
        /usr/local/go/src/io/io.go:310 +0x87
io.ReadFull(...)
        /usr/local/go/src/io/io.go:329
golang.org/x/net/http2.readFrameHeader(0xc001e66498, 0x9, 0x9, 0x1496d40, 0xc004b59820, 0x0, 0x0, 0xc0091bf440, 0x0)
        /gazette/.build/go-path/pkg/mod/golang.org/x/net@v0.0.0-20200324143707-d3edc9973b7e/http2/frame.go:237 +0x87
golang.org/x/net/http2.(*Framer).ReadFrame(0xc001e66460, 0x14a4d60, 0xc0091bf440, 0x0, 0x0)
        /gazette/.build/go-path/pkg/mod/golang.org/x/net@v0.0.0-20200324143707-d3edc9973b7e/http2/frame.go:492 +0xa1
github.com/soheilhy/cmux.matchHTTP2Field(0x1497b80, 0xc0013b8020, 0x1496d40, 0xc004b59820, 0x121d9ea, 0xc, 0xc0054e9ec0, 0x415285)
        /gazette/.build/go-path/pkg/mod/github.com/soheilhy/cmux@v0.1.4/matchers.go:236 +0x148
github.com/soheilhy/cmux.HTTP2MatchHeaderFieldSendSettings.func1(0x1497b80, 0xc0013b8020, 0x1496d40, 0xc004b59820, 0xc0013b8020)
        /gazette/.build/go-path/pkg/mod/github.com/soheilhy/cmux@v0.1.4/matchers.go:173 +0xca
github.com/soheilhy/cmux.(*cMux).serve(0xc00012a4b0, 0x14c1ea0, 0xc0013b8020, 0xc00015c120, 0xc00066c000)
        /gazette/.build/go-path/pkg/mod/github.com/soheilhy/cmux@v0.1.4/cmux.go:184 +0x1d3
created by github.com/soheilhy/cmux.(*cMux).Serve
        /gazette/.build/go-path/pkg/mod/github.com/soheilhy/cmux@v0.1.4/cmux.go:171 +0x19d

Meanwhile, gRPC Serve() is blocked waiting for Accept to return. It must do so before it can notify the gRPC server’s own WG, which is a prerequisite for GracefulStop or Stop to return:

goroutine 1800 [chan receive, 5 minutes]:
github.com/soheilhy/cmux.muxListener.Accept(...)
        /gazette/.build/go-path/pkg/mod/github.com/soheilhy/cmux@v0.1.4/cmux.go:229
google.golang.org/grpc.(*Server).Serve(0xc0001661a0, 0x14b2060, 0xc000113820, 0x0, 0x0)
        /gazette/.build/go-path/pkg/mod/google.golang.org/grpc@v1.28.0/server.go:621 +0x210
go.gazette.dev/core/server.(*Server).QueueTasks.func3(0x0, 0x1)
        /gazette/server/server.go:136 +0x44
go.gazette.dev/core/task.(*Group).GoRun.func1(0x0, 0x0)
        /gazette/task/group.go:72 +0x43
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc000530a80, 0xc0013900c0)
        /gazette/.build/go-path/pkg/mod/golang.org/x/sync@v0.0.0-20200317015054-43a5402ce75a/errgroup/errgroup.go:57 +0x59
created by golang.org/x/sync/errgroup.(*Group).Go
        /gazette/.build/go-path/pkg/mod/golang.org/x/sync@v0.0.0-20200317015054-43a5402ce75a/errgroup/errgroup.go:54 +0x66

For completeness, here's where GracefulStop is wedged waiting on it's WG, held hostage by grpc.Serve:

goroutine 1802 [semacquire, 5 minutes]:
sync.runtime_Semacquire(0xc00016631c)
        /usr/local/go/src/runtime/sema.go:56 +0x42
sync.(*WaitGroup).Wait(0xc00016631c)
        /usr/local/go/src/sync/waitgroup.go:130 +0x64
google.golang.org/grpc.(*Server).GracefulStop(0xc0001661a0)
        /gazette/.build/go-path/pkg/mod/google.golang.org/grpc@v1.28.0/server.go:1551 +0x1b1
go.gazette.dev/core/broker.(*Service).QueueTasks.func2(0xc00030eb90, 0x50)
        /gazette/broker/service.go:71 +0xa7
go.gazette.dev/core/task.(*Group).GoRun.func1(0x14b6660, 0xc0004b03c0)
        /gazette/task/group.go:72 +0x43
golang.org/x/sync/errgroup.(*Group).Go.func1(0xc000530a80, 0xc001390100)
        /gazette/.build/go-path/pkg/mod/golang.org/x/sync@v0.0.0-20200317015054-43a5402ce75a/errgroup/errgroup.go:57 +0x59
created by golang.org/x/sync/errgroup.(*Group).Go
        /gazette/.build/go-path/pkg/mod/golang.org/x/sync@v0.0.0-20200317015054-43a5402ce75a/errgroup/errgroup.go:54 +0x66
jgraettinger commented 4 years ago

An effective mitigation of this behavior is to set a non-zero SetReadTimeout. That timeout applies only while sniffing an appropriate mux, just after initial connection

brentyates commented 3 years ago

Would it be reasonable to add a wg.Done() before returning the error on line https://github.com/soheilhy/cmux/blob/master/cmux.go#L165? For me this solution seems to work, but I do not know what other potential problems this might cause.

brentyates commented 3 years ago

Adding the wg.Done() only helps me I guess because I know that the for loop inside serve ran at least once first.