Closed latentflip closed 8 months ago
I think this is where the initial "healthy" state comes from: https://github.com/vitessio/vitess/blob/ece6501975604ef5d0be3ab256145047e13d9c00/go/vt/vttablet/tabletserver/health_streamer.go#L212-L213
👋 all, we are seeing the same issue in production. Happy to collaborate on a fix!
@latentflip / @arthurschreiber I did some testing on:
And I tested coredump behaviour by sending a SIGSEGV
(11) signal to mysqld
, causing it to dump out a 40GB+ core file for about a minute in our case
During this time I gathered goroutine pprof profiles once a second. In essentially every 1 second samples, up until the time the tablet became NOT_SERVING
, this stack is seen blocking:
1 @ 0x438856 0x430e17 0x463a85 0x4edd32 0x4ef09a 0x4ef088 0x5b4249 0x5c5585 0x571a34 0x4a919a 0xf2c34f 0xf2c32e 0xf2c5a8 0xf4585f 0xf44446 0xf4401d 0x1051d69 0x1051d5f 0x1096cc5 0x109ac13 0x133dc6b 0x133db1d 0x104cbe3 0x4694a1
# 0x463a84 internal/poll.runtime_pollWait+0x84 runtime/netpoll.go:303
# 0x4edd31 internal/poll.(*pollDesc).wait+0x31 internal/poll/fd_poll_runtime.go:84
# 0x4ef099 internal/poll.(*pollDesc).waitRead+0x259 internal/poll/fd_poll_runtime.go:89
# 0x4ef087 internal/poll.(*FD).Read+0x247 internal/poll/fd_unix.go:167
# 0x5b4248 net.(*netFD).Read+0x28 net/fd_posix.go:56
# 0x5c5584 net.(*conn).Read+0x44 net/net.go:183
# 0x571a33 bufio.(*Reader).Read+0x1b3 bufio/bufio.go:227
# 0x4a9199 io.ReadAtLeast+0x99 io/io.go:328
# 0xf2c34e io.ReadFull+0x4e io/io.go:347
# 0xf2c32d vitess.io/vitess/go/mysql.(*Conn).readHeaderFrom+0x2d vitess.io/vitess/go/mysql/conn.go:353
# 0xf2c5a7 vitess.io/vitess/go/mysql.(*Conn).readEphemeralPacket+0x87 vitess.io/vitess/go/mysql/conn.go:391
# 0xf4585e vitess.io/vitess/go/mysql.(*Conn).readComQueryResponse+0x5e vitess.io/vitess/go/mysql/query.go:499
# 0xf44445 vitess.io/vitess/go/mysql.(*Conn).ReadQueryResult+0x65 vitess.io/vitess/go/mysql/query.go:355
# 0xf4401c vitess.io/vitess/go/mysql.(*Conn).ExecuteFetchMulti+0xbc vitess.io/vitess/go/mysql/query.go:324
# 0x1051d68 vitess.io/vitess/go/mysql.(*Conn).ExecuteFetch+0x28 vitess.io/vitess/go/mysql/query.go:303
# 0x1051d5e vitess.io/vitess/go/vt/dbconnpool.(*DBConnection).ExecuteFetch+0x1e vitess.io/vitess/go/vt/dbconnpool/connection.go:47
# 0x1096cc4 vitess.io/vitess/go/vt/mysqlctl.getPoolReconnect+0x64 vitess.io/vitess/go/vt/mysqlctl/query.go:40
# 0x109ac12 vitess.io/vitess/go/vt/mysqlctl.(*Mysqld).ReplicationStatus+0x92 vitess.io/vitess/go/vt/mysqlctl/replication.go:274
# 0x133dc6a vitess.io/vitess/go/vt/vttablet/tabletmanager.(*replManager).checkActionLocked+0x8a vitess.io/vitess/go/vt/vttablet/tabletmanager/replmanager.go:106
# 0x133db1c vitess.io/vitess/go/vt/vttablet/tabletmanager.(*replManager).check+0x7c vitess.io/vitess/go/vt/vttablet/tabletmanager/replmanager.go:102
# 0x104cbe2 vitess.io/vitess/go/timer.(*Timer).run+0x122 vitess.io/vitess/go/timer/timer.go:112
TL;DR: I think your proposal to add timeouts should resolve this
I will make a PR to try this in the next few days 👍
I will make a PR to try this in the next few days 👍
On 2nd thought, there are two ways this could be resolved:
context.WithTimeout(...)
or context.WithDeadline(...)
into ReplicationStatus
--mysqlctl-query-timeout
if hardcoded timeout doesn't workNOT_SERVING
state.SetReadDeadline()
on the underlying MySQL net.Conn
.Read()
, making it a bit easier to guess a timeout as it only applies to a single read vs an entire operationnet.(*conn).Read+0x44
(based on stack from last reply)net.Conn
is buried under a "db pool", then a "db connection", then a "mysql connection". It's not obvious how to set .SetReadDeadline()
from so high aboveThoughts appreciated!
I will make a PR to try this in the next few days 👍
On 2nd thought, there are two ways this could be resolved:
Pass a
context.WithTimeout(...)
orcontext.WithDeadline(...)
intoReplicationStatus
- Pros: easy change
Cons:
- What timeout do we use to fit all scenarios?
- Might need a new flag like
--mysqlctl-query-timeout
if hardcoded timeout doesn't work- Dangerous to be wrong about - causes
NOT_SERVING
stateAssuming an exclusive connection, do a
.SetReadDeadline()
on the underlying MySQLnet.Conn
Pros:
- Applies to single buffered-
.Read()
, making it a bit easier to guess a timeout as it only applies to a single read vs an entire operation- This would timeout higher in the stack at:
net.(*conn).Read+0x44
(based on stack from last reply)- Cons:
net.Conn
is buried under a "db pool", then a "db connection", then a "mysql connection". It's not obvious how to set.SetReadDeadline()
from so high aboveThoughts appreciated!
Vitess has query killer logic which can be reused here. It is an external go routine that needs to watch on the conn if it is still executing on the timeout. I do not think it is easily available to use here. But, a similar thing can be done on context timeout.
This query timeout is available on dbconn
but this connection over here is DBConnection
different from dbconn
. dbconn
has implementation taking account for context deadline. go/vt/vttablet/tabletserver/connpool/dbconn.go:setDeadline
. I am not sure why we have 2 different implementations of having a MySQL connection. I believe same logic can be borrowed here and even better if we can merge those two implementations.
Overview of the Issue
We had an issue today where:
serving true => false
but then it would immediately transition it back fromfalse => true
Cause
I believe @dm-2 and I have tracked this down to being caused by a few things:
context.TODO()
togetPoolReconnect
: https://github.com/vitessio/vitess/blob/a36de2c036e9ec9055ec11220ffabe6b1ed13e5d/go/vt/mysqlctl/replication.go#L326 and https://github.com/vitessio/vitess/blob/a36de2c036e9ec9055ec11220ffabe6b1ed13e5d/go/vt/mysqlctl/query.go#L34-L40.conn.ExecuteFetch
will then block indefinitely if the MySQL host is in a bad state after the crash while it's doing a core-dumptablet_health_check
code will detect that it's not seen an updated serving status come in from the vttablet for it's timeout (in our case 1 minute) and so will transition the vtgate's own state toserving false
.closeConnection
to transition the tablet toserving false
here: https://github.com/vitessio/vitess/blob/7fafc94671d746d4827a4b98e37f65036fd9bb29/go/vt/discovery/tablet_health_check.go#L334-L336serving true
and starts routing traffic to it again.Replication()
call from (1) holds thesm.mu.Lock()
the whole time, which causes other code to be blocked indefinitely that relies on taking the lock, like/vars/debug
which cannot get the currentIsServing
state because of the lock: https://github.com/vitessio/vitess/blob/7fafc94671d746d4827a4b98e37f65036fd9bb29/go/vt/vttablet/tabletserver/state_manager.go#L678-L683Possible fixes:
It seems like the connection setup, and query execution to mysql in
getPoolReconnect
needs a timeout to prevent it blocking indefinitely, and to set the vttablet's state to not serving if it cannot fetch replication state from mysql. I think that would resolve all the issues here?Reproduction Steps
I don't know how to reliably reproduce this with clear steps. I'm guessing the problem would be visible if we crashed mysql in a non-graceful way such that it didn't fully terminate connections (or whatever happens while it's doing a core dump).
I think the analysis above covers it though.
Binary Version
Operating System and Environment details
Log Fragments
we then see the exact same pattern repeated one minute later, and every minute, until the core-dump completes and the host restarts.