Closed GuptaManan100 closed 1 year ago
Trying to fix this issue isn't trivial. Here are my findings (and issues that have presented themselves).
So, we have a function call blocked on stream.SendMsg
.
The first solution we explored was to be able to close the stream. Unfortunately, their so no way to do that from the server side. From the client side, just cancelling the context works but on the server, we don't have access to the cancel func to be able to do that. I looked at the grpc repository to see how to do that. The only thing that I found was at https://github.com/grpc/grpc-go/issues/465#issuecomment-259944377 where it is said that the only way to unblock is to call stream.RecvMsg
, but that too isn't possible on the server side since this Recv has to happen on the client stream.
I then looked at the grpc docs. In one of the places https://grpc.io/docs/what-is-grpc/core-concepts/#cancelling-an-rpc, they say that a RPC can be cancelled from the client or the server, but they don't specify any way of doing that explicitly. https://grpc.io/docs/languages/go/basics/ has the docs on the different streaming rpcs and on how to call them. Here too, they don't mention how to stop a stream from the server. Maybe @vmg knows more about this. Me and @harshit-gangal couldn't figure this out.
Another possible fix is that we can just close the entire grpc server and restart it. According to the docs of grpc Server.Stop, https://pkg.go.dev/google.golang.org/grpc#Server.Stop this will close all active RPCs and then close the server. This could work for our use case. We can just restart the server in CheckMySQL.
Next possibiliy is to kill the vttablet process when CheckMySQL hangs. This can be done by an existing funciton setTimeBomb
. Currently the default is to not do this. We can parameterize the time-bomb value and just say that CheckMySQL being stuck will kill the vttablet after 2 minutes or whatever. (I personally don't think this is a good idea.)
Final possibility is that we can actually create a separate go-routine which sends the message and in the callback wait for the go-routine to finish or a context that we create to be cancelled. If the context is cancelled, we can exit out and error. What this accomplishes is that the we continue to be blocked on sendMsg
but now on a separate go-routine. So we can actually end the request and unblock CheckMySQL. Here are some code-snippets to show what I mean -
// In StreamExecute
ctx, cancel := context.WithCancel(ctx)
defer cancel()
tabletserver.ContextCancels = append(tabletserver.ContextCancels, cancel)
err = q.server.StreamExecute(ctx, request.Target, request.Query.Sql, request.Query.BindVariables, request.TransactionId, request.ReservedId, request.Options, func(reply *sqltypes.Result) error {
errChan := make(chan error)
log.Errorf("Sending a stream now with %d rows", len(reply.Rows))
go func() {
errChan <- stream.Send(&querypb.StreamExecuteResponse{
Result: sqltypes.ResultToProto3(reply),
})
}()
select {
case err2 := <-errChan:
log.Errorf("Finished sending a stream now, error = %v", err2)
return err2
case <-ctx.Done():
return ctx.Err()
}
})
// And in closeAll we also call cancelContexts before waiting for empty requests var ContextCancels []context.CancelFunc
func cancelContexts() { for _, cancel := range ContextCancels { cancel() } }
The code above is just a poc, it isn't production grade. I don't even cleanup the cancel funcs after use. But I have verified that this indeed does unblock CheckMySQL.
On further investigation, me and @harshit-gangal have gleaned more information. @harshit-gangal had an epiphany and remembered an old bug that we had a long time ago - https://github.com/vitessio/vitess/issues/5497. We verified that this is the underlying issue!
If a mysql client is connected to a vtgate, and it executes a streaming query and proceeds to close the entire connection, then we see vtgate and vttablets are stuck in StreamExecute
calls.
The steps for reproduction are as follows -
101_initial_cluster.sh
)insert into customer(email) select email from customer;
select * from customer
Notice that vtgate is indefinitely stuck on -
1 @ 0x10031e2f4 0x100317478 0x10034bde0 0x1003eeb08 0x1003f1990 0x1003f1981 0x100492ff8 0x1004a3cf4 0x10051da10 0x10051dccc 0x100c4efcc 0x100c4f410 0x100c68a00 0x100c54674 0x100c54685 0x101019ef4 0x100fe5c70 0x100fe4ca8 0x100fe5900 0x100f26db4 0x100e72084 0x101200018 0x100e71fc4 0x1010059e0 0x100e71e44 0x1010000fc 0x101002320 0x101001fdc 0x100fffe84 0x100fed860 0x10100bf8c 0x100f26a1c
# 0x10034bddf internal/poll.runtime_pollWait+0x9f runtime/netpoll.go:306
# 0x1003eeb07 internal/poll.(*pollDesc).wait+0x27 internal/poll/fd_poll_runtime.go:84
# 0x1003f198f internal/poll.(*pollDesc).waitWrite+0x24f internal/poll/fd_poll_runtime.go:93
# 0x1003f1980 internal/poll.(*FD).Write+0x240 internal/poll/fd_unix.go:391
# 0x100492ff7 net.(*netFD).Write+0x27 net/fd_posix.go:96
# 0x1004a3cf3 net.(*conn).Write+0x33 net/net.go:195
# 0x10051da0f bufio.(*Writer).Flush+0x5f bufio/bufio.go:628
# 0x10051dccb bufio.(*Writer).Write+0xcb bufio/bufio.go:672
# 0x100c4efcb vitess.io/vitess/go/mysql.(*Conn).writePacket+0x18b vitess.io/vitess/go/mysql/conn.go:599
# 0x100c4f40f vitess.io/vitess/go/mysql.(*Conn).writeEphemeralPacket+0x7f vitess.io/vitess/go/mysql/conn.go:651
# 0x100c689ff vitess.io/vitess/go/mysql.(*Conn).writeRow+0x1af vitess.io/vitess/go/mysql/query.go:965
# 0x100c54673 vitess.io/vitess/go/mysql.(*Conn).writeRows+0x1b3 vitess.io/vitess/go/mysql/query.go:996
# 0x100c54684 vitess.io/vitess/go/mysql.(*Conn).execQuery.func1+0x1c4 vitess.io/vitess/go/mysql/conn.go:1396
# 0x101019ef3 vitess.io/vitess/go/vt/vtgate.(*VTGate).StreamExecute.func1+0x83 vitess.io/vitess/go/vt/vtgate/vtgate.go:526
# 0x100fe5c6f vitess.io/vitess/go/vt/vtgate.(*Executor).StreamExecute.func1.1+0x34f vitess.io/vitess/go/vt/vtgate/executor.go:289
# 0x100fe4ca7 vitess.io/vitess/go/vt/vtgate.(*streaminResultReceiver).storeResultStats+0xf7 vitess.io/vitess/go/vt/vtgate/executor.go:241
# 0x100fe58ff vitess.io/vitess/go/vt/vtgate.(*Executor).StreamExecute.func1.2+0x2f vitess.io/vitess/go/vt/vtgate/executor.go:304
# 0x100f26db3 vitess.io/vitess/go/vt/vtgate/engine.(*Route).streamExecuteShards.func1+0x193 vitess.io/vitess/go/vt/vtgate/engine/route.go:343
# 0x100e72083 vitess.io/vitess/go/vt/vttablet/queryservice.(*wrappedService).StreamExecute.func1.1+0x33 vitess.io/vitess/go/vt/vttablet/queryservice/wrapped.go:198
# 0x101200017 vitess.io/vitess/go/vt/vttablet/grpctabletconn.(*gRPCQueryClient).StreamExecute+0x167 vitess.io/vitess/go/vt/vttablet/grpctabletconn/conn.go:190
# 0x100e71fc3 vitess.io/vitess/go/vt/vttablet/queryservice.(*wrappedService).StreamExecute.func1+0x103 vitess.io/vitess/go/vt/vttablet/queryservice/wrapped.go:196
# 0x1010059df vitess.io/vitess/go/vt/vtgate.(*TabletGateway).withRetry+0x3bf vitess.io/vitess/go/vt/vtgate/tabletgateway.go:350
# 0x100e71e43 vitess.io/vitess/go/vt/vttablet/queryservice.(*wrappedService).StreamExecute+0x133 vitess.io/vitess/go/vt/vttablet/queryservice/wrapped.go:194
# 0x1010000fb vitess.io/vitess/go/vt/vtgate.(*ScatterConn).StreamExecuteMulti.func1+0x1db vitess.io/vitess/go/vt/vtgate/scatter_conn.go:408
# 0x10100231f vitess.io/vitess/go/vt/vtgate.(*ScatterConn).multiGoTransaction.func1+0x13f vitess.io/vitess/go/vt/vtgate/scatter_conn.go:640
# 0x101001fdb vitess.io/vitess/go/vt/vtgate.(*ScatterConn).multiGoTransaction+0x30b vitess.io/vitess/go/vt/vtgate/scatter_conn.go:660
# 0x100fffe83 vitess.io/vitess/go/vt/vtgate.(*ScatterConn).StreamExecuteMulti+0x253 vitess.io/vitess/go/vt/vtgate/scatter_conn.go:360
# 0x100fed85f vitess.io/vitess/go/vt/vtgate.(*Executor).StreamExecuteMulti+0x3f vitess.io/vitess/go/vt/vtgate/executor.go:1329
# 0x10100bf8b vitess.io/vitess/go/vt/vtgate.(*vcursorImpl).StreamExecuteMulti+0x1bb vitess.io/vitess/go/vt/vtgate/vcursor_impl.go:550
# 0x100f26a1b vitess.io/vitess/go/vt/vtgate/engine.(*Route).streamExecuteShards+0x1ab vitess.io/vitess/go/vt/vtgate/engine/route.go:342
and so it the vttablet -
1 @ 0x1047a24a4 0x1047b32c0 0x104be5f1c 0x104be5dd1 0x104c60f6c 0x1058e1034 0x1058ee0f8 0x105790d44 0x10577f0f8 0x1057856f8 0x10566f144 0x10562812c 0x10566f3ec 0x10566ed68 0x1057854cc 0x10577f058 0x105790ac4 0x10577ea0c 0x105796f9c 0x10579af40 0x105796cc0 0x105796b80 0x1058edfa0 0x1058e0fb4 0x104c551fc 0x104c56874 0x104c4ffa4 0x1047d6bf4
# 0x104be5f1b google.golang.org/grpc/internal/transport.(*writeQuota).get+0x1eb google.golang.org/grpc@v1.52.3/internal/transport/flowcontrol.go:59
# 0x104be5dd0 google.golang.org/grpc/internal/transport.(*http2Server).Write+0xa0 google.golang.org/grpc@v1.52.3/internal/transport/http2_server.go:1086
# 0x104c60f6b google.golang.org/grpc.(*serverStream).SendMsg+0x12b google.golang.org/grpc@v1.52.3/stream.go:1594
# 0x1058e1033 vitess.io/vitess/go/vt/proto/queryservice.(*queryStreamExecuteServer).Send+0x33 vitess.io/vitess/go/vt/proto/queryservice/queryservice_grpc.pb.go:769
# 0x1058ee0f7 vitess.io/vitess/go/vt/vttablet/grpcqueryservice.(*query).StreamExecute.func1+0x77 vitess.io/vitess/go/vt/vttablet/grpcqueryservice/server.go:68
# 0x105790d43 vitess.io/vitess/go/vt/vttablet/tabletserver.(*StreamConsolidator).Consolidate.func3+0xb3 vitess.io/vitess/go/vt/vttablet/tabletserver/stream_consolidator.go:165
# 0x10577f0f7 vitess.io/vitess/go/vt/vttablet/tabletserver.(*QueryExecutor).Stream.func2.1+0x47 vitess.io/vitess/go/vt/vttablet/tabletserver/query_executor.go:345
# 0x1057856f7 vitess.io/vitess/go/vt/vttablet/tabletserver.(*QueryExecutor).execStreamSQL.func1+0x67 vitess.io/vitess/go/vt/vttablet/tabletserver/query_executor.go:1083
# 0x10566f143 vitess.io/vitess/go/vt/vttablet/tabletserver/connpool.(*DBConn).Stream.func1+0x143 vitess.io/vitess/go/vt/vttablet/tabletserver/connpool/dbconn.go:227
# 0x10562812b vitess.io/vitess/go/vt/dbconnpool.(*DBConnection).ExecuteStreamFetch+0x2db vitess.io/vitess/go/vt/dbconnpool/connection.go:94
# 0x10566f3eb vitess.io/vitess/go/vt/vttablet/tabletserver/connpool.(*DBConn).streamOnce+0x14b vitess.io/vitess/go/vt/vttablet/tabletserver/connpool/dbconn.go:272
# 0x10566ed67 vitess.io/vitess/go/vt/vttablet/tabletserver/connpool.(*DBConn).Stream+0x147 vitess.io/vitess/go/vt/vttablet/tabletserver/connpool/dbconn.go:219
# 0x1057854cb vitess.io/vitess/go/vt/vttablet/tabletserver.(*QueryExecutor).execStreamSQL+0x2db vitess.io/vitess/go/vt/vttablet/tabletserver/query_executor.go:1102
# 0x10577f057 vitess.io/vitess/go/vt/vttablet/tabletserver.(*QueryExecutor).Stream.func2+0xf7 vitess.io/vitess/go/vt/vttablet/tabletserver/query_executor.go:337
# 0x105790ac3 vitess.io/vitess/go/vt/vttablet/tabletserver.(*StreamConsolidator).Consolidate+0x313 vitess.io/vitess/go/vt/vttablet/tabletserver/stream_consolidator.go:151
# 0x10577ea0b vitess.io/vitess/go/vt/vttablet/tabletserver.(*QueryExecutor).Stream+0x3cb vitess.io/vitess/go/vt/vttablet/tabletserver/query_executor.go:330
# 0x105796f9b vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).streamExecute.func1+0x24b vitess.io/vitess/go/vt/vttablet/tabletserver/tabletserver.go:908
# 0x10579af3f vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).execRequest+0x51f vitess.io/vitess/go/vt/vttablet/tabletserver/tabletserver.go:1473
# 0x105796cbf vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).streamExecute+0xcf vitess.io/vitess/go/vt/vttablet/tabletserver/tabletserver.go:865
# 0x105796b7f vitess.io/vitess/go/vt/vttablet/tabletserver.(*TabletServer).StreamExecute+0xbf vitess.io/vitess/go/vt/vttablet/tabletserver/tabletserver.go:852
# 0x1058edf9f vitess.io/vitess/go/vt/vttablet/grpcqueryservice.(*query).StreamExecute+0x14f vitess.io/vitess/go/vt/vttablet/grpcqueryservice/server.go:67
# 0x1058e0fb3 vitess.io/vitess/go/vt/proto/queryservice._Query_StreamExecute_Handler+0xd3 vitess.io/vitess/go/vt/proto/queryservice/queryservice_grpc.pb.go:756
# 0x104c551fb google.golang.org/grpc.(*Server).processStreamingRPC+0xfbb google.golang.org/grpc@v1.52.3/server.go:1620
# 0x104c56873 google.golang.org/grpc.(*Server).handleStream+0x803 google.golang.org/grpc@v1.52.3/server.go:1708
# 0x104c4ffa3 google.golang.org/grpc.(*Server).serveStreams.func1.2+0x83 google.golang.org/grpc@v1.52.3/server.go:965
We had wireshark running too when we ran this test and we see that when the terminal is closed, we get a KILL QUERY
command as described in https://github.com/vitessio/vitess/issues/5497#issue-532352659, but vtgate just sends a syntax error for it.
The problem therefore is immediately evident, vtgate is indefinitely waiting on trying to send a packet on a connection which no longer has a listener.
This is very concerning since I would have thought that if a connection is closed from one side, it would automatically get the writes to fail...
In any case, we can fix the problem by not ignoring the KILL QUERY
packet that we get and use that to actually stop the execution of the query in question.
Overview of the Issue
CheckMySQL can get indefinitely stuck if there is a long-running callback in StreamExecute.
The way that
StreamExecute
rpc from vtgate works is as follows -vtgate and vttablet communicate via grpc and vttablets call
stream.Send
while vtgates call the counter-partstream.Recv
.vtgates after receiving a result set, run the callback function on it. Let's say that the callback function is a long-running one and would take a lot of time.
vttablet on the other hand would keep getting results from MySQL and call
stream.Send
repeatedly. According to the docs ofstream.Send
-even though SendMsg doesn't block for the receiver to receive the message, it does block for flow control. So if the streaming query requires a lot of results to be sent, the first few calls don't block, but the next one does, because of flow control.
If MySQL crashes now, StreamExecute continues to stay blocked. Now since StreamExecute is blocked on this call, it doesn't finish its execution of
execRequest
. This causes the wait for requests to be blocked until the callback function from vtgate doesn't complete. When it does, one more message is received, clearing more space in the buffer for StreamExecute. This causes StreamExecute to attempt to read one more packet from MySQL and finally fail unblocking everything.Reproduction Steps
select * from table
as the query and give it a very long callback. (Add time.Sleep to it, etc)Binary Version
Operating System and Environment details
Log Fragments
No response