Bug Report: VTTablet & VTGate routing traffic to hung MySQL server

latentflip commented 1 year ago

Overview of the Issue

We had an issue today where:

MySQL (5.7) crashed due to a corrupt memory error on the machine
This triggered a core dump, which due to the large amount of memory that needed to be dumped took around 19 minutes
During that time VTGate continued to route traffic to the VTTablet, and those queries hung indefinitely.
Every minute the VTGate would report that it was transitioning the VTTablet from serving true => false but then it would immediately transition it back from false => true
As soon as the core-dump was complete (19 minutes later), and MySQL restarted, the VTTablet spotted that replication lag was now high and took the host offline until it caught up.
Once replication lag caught up, the host was moved back to serving, and things were all :+1:

Cause

I believe @dm-2 and I have tracked this down to being caused by a few things:

The code that checks the replication status of the MySQL server passes a context.TODO() to getPoolReconnect: https://github.com/vitessio/vitess/blob/a36de2c036e9ec9055ec11220ffabe6b1ed13e5d/go/vt/mysqlctl/replication.go#L326 and https://github.com/vitessio/vitess/blob/a36de2c036e9ec9055ec11220ffabe6b1ed13e5d/go/vt/mysqlctl/query.go#L34-L40.
- As far as I can tell there are no cancellations or timeouts on this code. I believe that means the conn.ExecuteFetch will then block indefinitely if the MySQL host is in a bad state after the crash while it's doing a core-dump
- The fact that this code blocks, then prevents repltracker.Status() from completing, which prevents the health state of the vttablet updating: lag, err := sm.rt.Status(), and prevents the updates state being broadcast back to the vtgate streamHealth.
On its own, this bug shouldn't be a huge deal, as the vtgate tablet_health_check code will detect that it's not seen an updated serving status come in from the vttablet for it's timeout (in our case 1 minute) and so will transition the vtgate's own state to serving false.
- This does indeed happen, and we see in the logs that we called closeConnection to transition the tablet to serving false here: https://github.com/vitessio/vitess/blob/7fafc94671d746d4827a4b98e37f65036fd9bb29/go/vt/discovery/tablet_health_check.go#L334-L336
- However, after a short delay we will retry, and we will create a new StreamHealth connection to the vttablet to try again. When the initial connection is made, vttablet immediately sends back the "current" health state of the vttablet: https://github.com/vitessio/vitess/blob/a36de2c036e9ec9055ec11220ffabe6b1ed13e5d/go/vt/vttablet/tabletserver/health_streamer.go#L211-L212. This state will still say that the tablet is healthy, as what we learned from (1) is that we are blocked on setting the state to unhealthy due to the unresponsive mysql server.
- The vtgate then sees this "healthy" response, and transitions it's view of the vttablet back to serving true and starts routing traffic to it again.
Another side-effect of this we spotted is that the Replication() call from (1) holds the sm.mu.Lock() the whole time, which causes other code to be blocked indefinitely that relies on taking the lock, like /vars/debug which cannot get the current IsServing state because of the lock: https://github.com/vitessio/vitess/blob/7fafc94671d746d4827a4b98e37f65036fd9bb29/go/vt/vttablet/tabletserver/state_manager.go#L678-L683

Possible fixes:

It seems like the connection setup, and query execution to mysql in getPoolReconnect needs a timeout to prevent it blocking indefinitely, and to set the vttablet's state to not serving if it cannot fetch replication state from mysql. I think that would resolve all the issues here?

Reproduction Steps

I don't know how to reliably reproduce this with clear steps. I'm guessing the problem would be visible if we crashed mysql in a non-graceful way such that it didn't fully terminate connections (or whatever happens while it's doing a core dump).

I think the analysis above covers it though.

Binary Version

Server version: 5.7.32-vitess Version: 14.0.1 (Git revision 631084ae79181ba816ba2d98bee07c16d8b2f7b4 branch 'master') built on Mon Nov 21 16:30:24 UTC 2022 by root@bcaa51ae028b using go1.18.4 linux/amd64

Operating System and Environment details

I don't know this off-hand but I don't know how pertinent it is given the details above?

Log Fragments

W1205 10:17:52.962960       1 tablet_health_check.go:335] tablet alias:{cell:"ash1" uid:171471394} hostname:"<crashed host>" port_map:{key:"grpc" value:15991} port_map:{key:"vt" value:15101} keyspace:"x_ks" shard:"c0-" key_range:{start:"\xc0"} type:REPLICA db_name_override:"x" mysql_hostname:"<crashed host>" mysql_port:3306 db_server_version:"0.0.0" default_conn_collation:45 healthcheck stream error: Code: CANCELED

I1205 10:17:52.963050       1 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: ash1-171471394 (<crashed host>) serving true => false for x_ks/c0- (REPLICA) reason: vttablet: rpc error: code = Canceled desc = context canceled

I1205 10:17:53.051535       1 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: ash1-171471394 (<crashed host>) serving false => true for x_ks/c0- (REPLICA) reason: healthCheck update

we then see the exact same pattern repeated one minute later, and every minute, until the core-dump completes and the host restarts.

arthurschreiber commented 1 year ago

I think this is where the initial "healthy" state comes from: https://github.com/vitessio/vitess/blob/ece6501975604ef5d0be3ab256145047e13d9c00/go/vt/vttablet/tabletserver/health_streamer.go#L212-L213

timvaillancourt commented 1 year ago

👋 all, we are seeing the same issue in production. Happy to collaborate on a fix!

timvaillancourt commented 1 year ago

@latentflip / @arthurschreiber I did some testing on:

5.7 (Percona Server)
slackhq/vitess branch based on the vitessio/vitess v12.0.5 tag + patches unrelated to this codepath

And I tested coredump behaviour by sending a SIGSEGV (11) signal to mysqld, causing it to dump out a 40GB+ core file for about a minute in our case

During this time I gathered goroutine pprof profiles once a second. In essentially every 1 second samples, up until the time the tablet became NOT_SERVING, this stack is seen blocking:

1 @ 0x438856 0x430e17 0x463a85 0x4edd32 0x4ef09a 0x4ef088 0x5b4249 0x5c5585 0x571a34 0x4a919a 0xf2c34f 0xf2c32e 0xf2c5a8 0xf4585f 0xf44446 0xf4401d 0x1051d69 0x1051d5f 0x1096cc5 0x109ac13 0x133dc6b 0x133db1d 0x104cbe3 0x4694a1
#       0x463a84        internal/poll.runtime_pollWait+0x84                                                     runtime/netpoll.go:303
#       0x4edd31        internal/poll.(*pollDesc).wait+0x31                                                     internal/poll/fd_poll_runtime.go:84
#       0x4ef099        internal/poll.(*pollDesc).waitRead+0x259                                                internal/poll/fd_poll_runtime.go:89
#       0x4ef087        internal/poll.(*FD).Read+0x247                                                          internal/poll/fd_unix.go:167
#       0x5b4248        net.(*netFD).Read+0x28                                                                  net/fd_posix.go:56
#       0x5c5584        net.(*conn).Read+0x44                                                                   net/net.go:183
#       0x571a33        bufio.(*Reader).Read+0x1b3                                                              bufio/bufio.go:227
#       0x4a9199        io.ReadAtLeast+0x99                                                                     io/io.go:328
#       0xf2c34e        io.ReadFull+0x4e                                                                        io/io.go:347
#       0xf2c32d        vitess.io/vitess/go/mysql.(*Conn).readHeaderFrom+0x2d                                   vitess.io/vitess/go/mysql/conn.go:353
#       0xf2c5a7        vitess.io/vitess/go/mysql.(*Conn).readEphemeralPacket+0x87                              vitess.io/vitess/go/mysql/conn.go:391
#       0xf4585e        vitess.io/vitess/go/mysql.(*Conn).readComQueryResponse+0x5e                             vitess.io/vitess/go/mysql/query.go:499
#       0xf44445        vitess.io/vitess/go/mysql.(*Conn).ReadQueryResult+0x65                                  vitess.io/vitess/go/mysql/query.go:355
#       0xf4401c        vitess.io/vitess/go/mysql.(*Conn).ExecuteFetchMulti+0xbc                                vitess.io/vitess/go/mysql/query.go:324
#       0x1051d68       vitess.io/vitess/go/mysql.(*Conn).ExecuteFetch+0x28                                     vitess.io/vitess/go/mysql/query.go:303
#       0x1051d5e       vitess.io/vitess/go/vt/dbconnpool.(*DBConnection).ExecuteFetch+0x1e                     vitess.io/vitess/go/vt/dbconnpool/connection.go:47
#       0x1096cc4       vitess.io/vitess/go/vt/mysqlctl.getPoolReconnect+0x64                                   vitess.io/vitess/go/vt/mysqlctl/query.go:40
#       0x109ac12       vitess.io/vitess/go/vt/mysqlctl.(*Mysqld).ReplicationStatus+0x92                        vitess.io/vitess/go/vt/mysqlctl/replication.go:274
#       0x133dc6a       vitess.io/vitess/go/vt/vttablet/tabletmanager.(*replManager).checkActionLocked+0x8a     vitess.io/vitess/go/vt/vttablet/tabletmanager/replmanager.go:106
#       0x133db1c       vitess.io/vitess/go/vt/vttablet/tabletmanager.(*replManager).check+0x7c                 vitess.io/vitess/go/vt/vttablet/tabletmanager/replmanager.go:102
#       0x104cbe2       vitess.io/vitess/go/timer.(*Timer).run+0x122                                            vitess.io/vitess/go/timer/timer.go:112

TL;DR: I think your proposal to add timeouts should resolve this

I will make a PR to try this in the next few days 👍

timvaillancourt commented 1 year ago

I will make a PR to try this in the next few days 👍

On 2nd thought, there are two ways this could be resolved:

Pass a context.WithTimeout(...) or context.WithDeadline(...) into ReplicationStatus
- Pros: easy change
- Cons:
  - What timeout do we use to fit all scenarios?
  - Might need a new flag like --mysqlctl-query-timeout if hardcoded timeout doesn't work
  - Dangerous to be wrong about - causes NOT_SERVING state
Assuming an exclusive connection, do a .SetReadDeadline() on the underlying MySQL net.Conn
- Pros:
  - Applies to single buffered-.Read(), making it a bit easier to guess a timeout as it only applies to a single read vs an entire operation
  - This would timeout higher in the stack at: net.(*conn).Read+0x44 (based on stack from last reply)
- Cons: net.Conn is buried under a "db pool", then a "db connection", then a "mysql connection". It's not obvious how to set .SetReadDeadline() from so high above

Thoughts appreciated!

harshit-gangal commented 1 year ago

I will make a PR to try this in the next few days 👍

On 2nd thought, there are two ways this could be resolved:

Pass a context.WithTimeout(...) or context.WithDeadline(...) into ReplicationStatus

Pros: easy change

Cons:

What timeout do we use to fit all scenarios?

Might need a new flag like --mysqlctl-query-timeout if hardcoded timeout doesn't work

Dangerous to be wrong about - causes NOT_SERVING state

Assuming an exclusive connection, do a .SetReadDeadline() on the underlying MySQL net.Conn

Pros:

Applies to single buffered-.Read(), making it a bit easier to guess a timeout as it only applies to a single read vs an entire operation

This would timeout higher in the stack at: net.(*conn).Read+0x44 (based on stack from last reply)

Cons: net.Conn is buried under a "db pool", then a "db connection", then a "mysql connection". It's not obvious how to set .SetReadDeadline() from so high above

Thoughts appreciated!

Vitess has query killer logic which can be reused here. It is an external go routine that needs to watch on the conn if it is still executing on the timeout. I do not think it is easily available to use here. But, a similar thing can be done on context timeout.

This query timeout is available on dbconn but this connection over here is DBConnection different from dbconn. dbconn has implementation taking account for context deadline. go/vt/vttablet/tabletserver/connpool/dbconn.go:setDeadline. I am not sure why we have 2 different implementations of having a MySQL connection. I believe same logic can be borrowed here and even better if we can merge those two implementations.

vitessio / vitess