Closed vmg closed 2 years ago
More thoughts on proactively detecting re-sharding operations and failovers:
As explained on the previous comment, right now there's a shardBuffer
data structure that is lazily allocated for every keyspace+shard pair on first access, and whose buffering logic is only activated the first time there's an error in the destination shard. This design doesn't seem feasible once we lift this logic higher in the stack.
I think the solution we're looking for is a global map (for each vtgate
) of "cluster events" that is updated in real time. Each event would contain information about what keyspaces and shards are affected, and then we'd just use auxiliary data structures so that when a query plan comes in, we could efficiently detect whether any of the shards where the plan would be executed are currently affected by any events, and if so, so we could buffer the plan before attempting it. Likewise, the "event" itself would keep a goroutine actively polling the health of the cluster so that it can detect its own resolution, and when that happens, it would flush all the currently buffered requests that were waiting on the event.
Does this seem reasonable? I think the hardest problem about this is the case where we don't proactively detect an event in the cluster, and instead we receive a failover error from one of the tablets. Converting this tablet (and hence shard-) specific error into a global event that could potentially affect more than one shard seems complicated. How would we handle this?
DiscoveryGateway
is deprecated, replaced with TabletGateway
. The logic in withRetry
is pretty much identical in the two, so this doesn't make a material difference to the design. We probably don't want to implement the new buffering in DiscoveryGateway
which might require us to keep the current shard_buffer
logic around (or maybe we go ahead and delete the deprecated gateway now).Target
- i.e. keyspace/shard/tablet_type. When a failover succeeds, the promotion event is sent to subscribers of this healthcheck module. This is probably the part of code to look at next to see if it can be adapted easily to provide the information we need to detect the end of a resharding cutover. https://github.com/vitessio/vitess/blob/main/go/vt/discovery/healthcheck.go#L514-L523There are currently two code paths for query execution (Execute and StreamExecute), both should be covered by buffering.
It is ok to just make it for that shard event for which the tablet received the error. This way it will accumulate itself if errors are received from more tablets.
DiscoveryGateway is deprecated, replaced with TabletGateway. The logic in withRetry is pretty much identical in the two, so this doesn't make a material difference to the design. We probably don't want to implement the new buffering in DiscoveryGateway which might require us to keep the current shard_buffer logic around (or maybe we go ahead and delete the deprecated gateway now).
Thank you for pointing this out! Yes, the logic for retries looks copy-and-pasted between the two Gateways, so let's scope this down to TabletGateway
. I'll update the issue accordingly.
OK, I've spent this afternoon looking at the health checking code. I have questions about it: using this comment from @rohit-nayak-ps as a reference, here's how he believes we could model the two events (start & end of a resharding):
topo.Server
, for the keyspace where the query's plan will execute (a query plan will only execute in a single keyspace -- is this correct?). So, to sketch this out:srvKeyspace, err := ts.GetSrvKeyspace(ctx, cell, keyspace)
if err != nil {
return err
}
for _, partition := range srvKeyspace.GetPartitions() {
for _, shardTabletControl := range partition.GetShardTabletControls() {
if shardTabletControl.QueryServiceDisabled {
// this query needs to buffer
}
}
}
Questions about this: we clearly know the keyspace for a plan, but what's its cell? Do we need to use GetSrvKeyspaceAllCells
? Is this even something that we could run once-per query? This would be powered by ResilientServer
's caching, because reaching to topo.Server
directly (i.e. etcd
) for every single call would be unfeasible. Is the caching here going to lead to racy issues though?
The only place where WatchSrvKeyspace
is being called right now is from the ResilientServer
-- it's used for proactively fetching keyspace information so that GetSrvKeyspace
returns up to date information. This usage of WatchSrvKeyspace
however does not expose the actual "diff" between the previous version of the keyspace metadata, which is what we would need to detect changes in ShardReferences
. Is this functionality that you think we should bolt on top of ResilientServer
, or do we need a separate implementation, even though that would actually duplicate the amount of Watch connections into the upstream topo server (etcd
, ...) for each vtgate
?
There might be a simpler way to detect cutover start and stop. In the original comment @rohit-nayak-ps says that when cutover starts we transition the primary of the old shard to NON_SERVING, and when cutover ends we transition the primaries of new shards to SERVING. Serving state is available in healthcheck and we could use that to notify the buffering code. This will side-step the topo/cell issue altogether.
when cutover ends we transition the primaries of new shards to SERVING.
I've been playing with this and it seems complicated because the primaries of the new shards are transitioned asynchronously. The HealthCheck
API lets us subscribe to all new Tablet Health events through a channel, but these events come through one-by-one.
Say that we're attempting to run a query that targets shard X: we see that the primary of shard X is currently not serving, so we start buffering the query. Since we receive shard health updates one by one, we'll eventually receive a health update for a shard we've never seen before (the result of the re-sharding), which won't be serving, and then another update that marks that shard as serving. The issue here is that re-sharding will usually (always?) result in more than one new shard, but we do not know how many new shards are we expecting to see in the HealthCheck
stream.
The best way to work around this issue is trying to guess when we have access to the whole keyspace for a shard by parsing and putting together the name of the shards -- I am just not sure whether this is always safe to do for all sharding cases (and whether we have existing code in Vitess to parse shard names and see if they add up to the whole keyspace).
There are some KeyRange/Shard utilities here: https://github.com/vitessio/vitess/blob/main/go/vt/key/key.go and here: https://github.com/vitessio/vitess/blob/main/go/vt/topotools/split.go
The issue here is that re-sharding will usually (always?) result in more than one new shard, but we do not know how many new shards are we expecting to see in the
HealthCheck
stream.Eventually we also need to worry about "Merging" shards where we may end up with fewer shards (as few as one).
Eventually we also need to worry about "Merging" shards where we may end up with fewer shards (as few as one).
Right, I thought about this use case, and from my understanding, the resulting shard names for the merge should also add up to the whole keyspace. Is there any other sharding operation that could result in a partial keyspace and hence couldn't be detected by this heuristic?
At the keyspace level it should be safe to decide when we have access to the whole keyspace by parsing and putting together the names of the shards.
There are people doing large reshards where they break it up into individual vreplication streams and manage the cutovers themselves. This won't work for those cases. We might have to explicitly warn that buffering should not be used in those cases.
This also means that we should introduce a new flag for buffering during resharding cutover and not trigger it using the existing buffering flag enable_buffer
.
I think, we should put this information upfront in the topo server, so that we do not have to assume it. Whenever such operation happens, it is stored in topo server to be discovered by buffering code.
Following up on the feedback on #8514, I've been researching other approaches to detect these sharding operations proactively that do not depend on the topology server.
@harshit-gangal suggested to run the example resharding workflows in the Vitess examples locally and analyze the resulting health check event patterns to see how much information we can gather from it. It's not looking good.
For shard initialization:
for i in 300 301 302; do
CELL=zone1 TABLET_UID=$i ./scripts/mysqlctl-up.sh
SHARD=-80 CELL=zone1 KEYSPACE=customer TABLET_UID=$i ./scripts/vttablet-up.sh
done
for i in 400 401 402; do
CELL=zone1 TABLET_UID=$i ./scripts/mysqlctl-up.sh
SHARD=80- CELL=zone1 KEYSPACE=customer TABLET_UID=$i ./scripts/vttablet-up.sh
done
vtctldclient InitShardPrimary --force customer/-80 zone1-300
vtctldclient InitShardPrimary --force customer/80- zone1-400
Here are the actual health status changes as seen on the vtgate
:
I0726 13:15:32.928007 40734 tabletgateway.go:134]
[HEALTH CHECK zone1-401]
NEW: {"Tablet":{"alias":{"cell":"zone1","uid":401},"hostname":"three-seagrass-linux","port_map":{"grpc":16401,"vt":15401},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17401},"Target":{"keyspace":"customer","shard":"80-","tablet_type":2},"MasterTermStartTime":0,"Serving":false}
I0726 13:15:32.928001 40734 healthcheck.go:328] Adding tablet to healthcheck: alias:{cell:"zone1" uid:400} hostname:"three-seagrass-linux" port_map:{key:"grpc" value:16400} port_map:{key:"vt" value:15400} keyspace:"customer" shard:"80-" key_range:{start:"\x80"} type:REPLICA mysql_hostname:"three-seagrass-linux" mysql_port:17400
I0726 13:15:32.928025 40734 tabletgateway.go:134]
[HEALTH CHECK zone1-301]
NEW: {"Tablet":{"alias":{"cell":"zone1","uid":301},"hostname":"three-seagrass-linux","port_map":{"grpc":16301,"vt":15301},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17301},"Target":{"keyspace":"customer","shard":"-80","tablet_type":2},"MasterTermStartTime":0,"Serving":false}
I0726 13:15:32.928042 40734 tabletgateway.go:134]
[HEALTH CHECK zone1-300]
NEW: {"Tablet":{"alias":{"cell":"zone1","uid":300},"hostname":"three-seagrass-linux","port_map":{"grpc":16300,"vt":15300},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17300},"Target":{"keyspace":"customer","shard":"-80","tablet_type":2},"MasterTermStartTime":0,"Serving":false}
I0726 13:15:32.928056 40734 tabletgateway.go:134]
[HEALTH CHECK zone1-400]
NEW: {"Tablet":{"alias":{"cell":"zone1","uid":400},"hostname":"three-seagrass-linux","port_map":{"grpc":16400,"vt":15400},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17400},"Target":{"keyspace":"customer","shard":"80-","tablet_type":2},"MasterTermStartTime":0,"Serving":false}
I0726 13:15:32.928717 40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-400 (three-seagrass-linux) serving false => false for customer/80- (REPLICA) reason: healthCheck update error: vttablet error: no replication status
I0726 13:15:32.928717 40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-302 (three-seagrass-linux) serving false => false for customer/-80 (RDONLY) reason: healthCheck update error: vttablet error: no replication status
I0726 13:15:32.928739 40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-301 (three-seagrass-linux) serving false => false for customer/-80 (REPLICA) reason: healthCheck update error: vttablet error: no replication status
I0726 13:15:32.928743 40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-401 (three-seagrass-linux) serving false => false for customer/80- (REPLICA) reason: healthCheck update error: vttablet error: no replication status
I0726 13:15:32.928797 40734 tabletgateway.go:134]
[HEALTH CHECK zone1-302]
NEW: {"Tablet":{"alias":{"cell":"zone1","uid":302},"hostname":"three-seagrass-linux","port_map":{"grpc":16302,"vt":15302},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":3,"mysql_hostname":"three-seagrass-linux","mysql_port":17302},"Target":{"keyspace":"customer","shard":"-80","tablet_type":3},"MasterTermStartTime":0,"Serving":false}
I0726 13:15:32.928808 40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-300 (three-seagrass-linux) serving false => false for customer/-80 (REPLICA) reason: healthCheck update error: vttablet error: no replication status
I0726 13:15:33.900173 40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-300 (three-seagrass-linux) serving false => true for customer/-80 (MASTER) reason: healthCheck update
E0726 13:15:33.900217 40734 healthcheck.go:483] Adding 1 to MasterPromoted counter for target: keyspace:"customer" shard:"-80" tablet_type:REPLICA, tablet: zone1-0000000300, tabletType: MASTER
I0726 13:15:33.900700 40734 tabletgateway.go:138]
[HEALTH CHECK zone1-300]
OLD: {"Tablet":{"alias":{"cell":"zone1","uid":300},"hostname":"three-seagrass-linux","port_map":{"grpc":16300,"vt":15300},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17300},"Target":{"keyspace":"customer","shard":"-80","tablet_type":2},"MasterTermStartTime":0,"Serving":false}
NEW: {"Tablet":{"alias":{"cell":"zone1","uid":300},"hostname":"three-seagrass-linux","port_map":{"grpc":16300,"vt":15300},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17300},"Target":{"keyspace":"customer","shard":"-80","tablet_type":1},"MasterTermStartTime":1627298133,"Serving":true}
I0726 13:15:34.508368 40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-400 (three-seagrass-linux) serving false => true for customer/80- (MASTER) reason: healthCheck update
E0726 13:15:34.508395 40734 healthcheck.go:483] Adding 1 to MasterPromoted counter for target: keyspace:"customer" shard:"80-" tablet_type:REPLICA, tablet: zone1-0000000400, tabletType: MASTER
I0726 13:15:34.508503 40734 tabletgateway.go:138]
[HEALTH CHECK zone1-400]
OLD: {"Tablet":{"alias":{"cell":"zone1","uid":400},"hostname":"three-seagrass-linux","port_map":{"grpc":16400,"vt":15400},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17400},"Target":{"keyspace":"customer","shard":"80-","tablet_type":2},"MasterTermStartTime":0,"Serving":false}
NEW: {"Tablet":{"alias":{"cell":"zone1","uid":400},"hostname":"three-seagrass-linux","port_map":{"grpc":16400,"vt":15400},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17400},"Target":{"keyspace":"customer","shard":"80-","tablet_type":1},"MasterTermStartTime":1627298134,"Serving":true}
I0726 13:15:34.594006 40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-301 (three-seagrass-linux) serving false => true for customer/-80 (REPLICA) reason: healthCheck update
I0726 13:15:34.594076 40734 tabletgateway.go:138]
[HEALTH CHECK zone1-301]
OLD: {"Tablet":{"alias":{"cell":"zone1","uid":301},"hostname":"three-seagrass-linux","port_map":{"grpc":16301,"vt":15301},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17301},"Target":{"keyspace":"customer","shard":"-80","tablet_type":2},"MasterTermStartTime":0,"Serving":false}
NEW: {"Tablet":{"alias":{"cell":"zone1","uid":301},"hostname":"three-seagrass-linux","port_map":{"grpc":16301,"vt":15301},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17301},"Target":{"keyspace":"customer","shard":"-80","tablet_type":2},"MasterTermStartTime":0,"Serving":true}
I0726 13:15:35.098106 40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-302 (three-seagrass-linux) serving false => true for customer/-80 (RDONLY) reason: healthCheck update
I0726 13:15:35.098197 40734 tabletgateway.go:138]
[HEALTH CHECK zone1-302]
OLD: {"Tablet":{"alias":{"cell":"zone1","uid":302},"hostname":"three-seagrass-linux","port_map":{"grpc":16302,"vt":15302},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":3,"mysql_hostname":"three-seagrass-linux","mysql_port":17302},"Target":{"keyspace":"customer","shard":"-80","tablet_type":3},"MasterTermStartTime":0,"Serving":false}
NEW: {"Tablet":{"alias":{"cell":"zone1","uid":302},"hostname":"three-seagrass-linux","port_map":{"grpc":16302,"vt":15302},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":3,"mysql_hostname":"three-seagrass-linux","mysql_port":17302},"Target":{"keyspace":"customer","shard":"-80","tablet_type":3},"MasterTermStartTime":0,"Serving":true}
I0726 13:15:35.509942 40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-401 (three-seagrass-linux) serving false => true for customer/80- (REPLICA) reason: healthCheck update
I0726 13:15:35.510027 40734 tabletgateway.go:138]
[HEALTH CHECK zone1-401]
OLD: {"Tablet":{"alias":{"cell":"zone1","uid":401},"hostname":"three-seagrass-linux","port_map":{"grpc":16401,"vt":15401},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17401},"Target":{"keyspace":"customer","shard":"80-","tablet_type":2},"MasterTermStartTime":0,"Serving":false}
NEW: {"Tablet":{"alias":{"cell":"zone1","uid":401},"hostname":"three-seagrass-linux","port_map":{"grpc":16401,"vt":15401},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17401},"Target":{"keyspace":"customer","shard":"80-","tablet_type":2},"MasterTermStartTime":0,"Serving":true}
I0726 13:16:32.927803 40734 healthcheck.go:328] Adding tablet to healthcheck: alias:{cell:"zone1" uid:402} hostname:"three-seagrass-linux" port_map:{key:"grpc" value:16402} port_map:{key:"vt" value:15402} keyspace:"customer" shard:"80-" key_range:{start:"\x80"} type:RDONLY mysql_hostname:"three-seagrass-linux" mysql_port:17402
I0726 13:16:32.927933 40734 tabletgateway.go:134]
[HEALTH CHECK zone1-402]
NEW: {"Tablet":{"alias":{"cell":"zone1","uid":402},"hostname":"three-seagrass-linux","port_map":{"grpc":16402,"vt":15402},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":3,"mysql_hostname":"three-seagrass-linux","mysql_port":17402},"Target":{"keyspace":"customer","shard":"80-","tablet_type":3},"MasterTermStartTime":0,"Serving":false}
I0726 13:16:32.928611 40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-402 (three-seagrass-linux) serving false => true for customer/80- (RDONLY) reason: healthCheck update
I0726 13:16:32.928700 40734 tabletgateway.go:138]
[HEALTH CHECK zone1-402]
OLD: {"Tablet":{"alias":{"cell":"zone1","uid":402},"hostname":"three-seagrass-linux","port_map":{"grpc":16402,"vt":15402},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":3,"mysql_hostname":"three-seagrass-linux","mysql_port":17402},"Target":{"keyspace":"customer","shard":"80-","tablet_type":3},"MasterTermStartTime":0,"Serving":false}
NEW: {"Tablet":{"alias":{"cell":"zone1","uid":402},"hostname":"three-seagrass-linux","port_map":{"grpc":16402,"vt":15402},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":3,"mysql_hostname":"three-seagrass-linux","mysql_port":17402},"Target":{"keyspace":"customer","shard":"80-","tablet_type":3},"MasterTermStartTime":0,"Serving":true}
For the resharding step:
vtctlclient Reshard customer.cust2cust '0' '-80,80-'
There are no health events in vtgate. This is expected because we're only setting up the resharding streams, not doing a cutover.
When switching reads:
vtctlclient SwitchReads -tablet_type=rdonly customer.cust2cust
vtctlclient SwitchReads -tablet_type=replica customer.cust2cust
There are no health events in vtgate either. This is already a bit fishy. None of the tablets change state (at least not visibly to the vtgate
health check). Shouldn't this be part of the cutover?
When switching writes:
vtctlclient SwitchWrites customer.cust2cust
Here's the other not-cromulent thing I'm seeing: there's just a single event in the HealthCheck stream here, the old tablet 200 going from serving to not serving:
I0726 15:49:21.076913 55874 tabletgateway.go:138]
[HEALTH CHECK zone1-200]
OLD: {"Tablet":{"alias":{"cell":"zone1","uid":200},"hostname":"three-seagrass-linux","port_map":{"grpc":16200,"vt":15200},"keyspace":"customer","shard":"0","type":1,"mysql_hostname":"three-seagrass-linux","mysql_port":17200,"master_term_start_time":{"seconds":1627305552,"nanoseconds":29687806}},"Target":{"keyspace":"customer","shard":"0","tablet_type":1},"MasterTermStartTime":1627305552,"Serving":true}
NEW: {"Tablet":{"alias":{"cell":"zone1","uid":200},"hostname":"three-seagrass-linux","port_map":{"grpc":16200,"vt":15200},"keyspace":"customer","shard":"0","type":1,"mysql_hostname":"three-seagrass-linux","mysql_port":17200,"master_term_start_time":{"seconds":1627305552,"nanoseconds":29687806}},"Target":{"keyspace":"customer","shard":"0","tablet_type":1},"MasterTermStartTime":1627305552,"Serving":false}
This is not quite than what I heard from @rohit-nayak-ps / @deepthi on previous comments:
when cutover starts we transition the primary of the old shard to NON_SERVING, and when cutover ends we transition the primaries of new shards to SERVING.
We can see how the primary of the old shard becomes non-serving, but the primaries of the new shards have been marked as SERVING since they were initially instantiated. Is this an issue with the way our examples are written (i.e. this behavior is not what we'd see in a production cluster)? Or is the previous description just not accurate at all?
Update on this:
We had a meeting last week where we made some decisions on how to approach this design without over-relying on the topology server. The two key points we agreed on are:
vttablet
returns these errors, so they're always properly tagged as failover errors, and we also need to refactor the way these are handled and scoped up in the vtgate
, so they're propagated high enough to be handled by the buffering code, which as discussed previously in this issue, needs to hook into the call path higher than it does right now in order to allow for replanning the whole query before retrying it.To detect the end of a failover/resharding event, we're going to use the topology watcher like we've done in the past. This doesn't make the situation any worse. We expect to find two kind of events after which buffering should be aborted, which can be detected by:
SrvKeyspace
changes, which should let us know when a failover or a resharding operation is finished SrvVSchema
changes, which should let us know when a MoveTables operation is finished (this is another failover-like event that requires buffering and cannot be detected via SrvKeyspace
changes).To support this use case, I've refactored the resilient server in https://github.com/vitessio/vitess/pull/8583 so we can watch both event types and so watching coalesces concurrent requests without affecting performance.
I hope this is a good enough summary! Work on step 2 is over, I'm now tackling step 1 which will require changing both vttablet and vtgate.
Thank you for the update @vmg . This all makes sense to me.
I have started working on this functionality for Vitess. To ellaborate on the problem, which is already listed in https://github.com/vitessio/vitess/issues/7061: we want to improve the way we perform buffering to be able to handle more complex resharding situations.
Currently, the main entrypoint for buffering is in
tabletgateway.go
, in theTabletGateway.withRetry
helper that wraps all the method calls for ourQueryService
:https://github.com/vitessio/vitess/blob/4bebb8e5a6097772bcb8a4d1c187f14782158069/go/vt/vtgate/tabletgateway.go#L216-L234
Here's a few important things to note:
DiscoveryGateway
, but this whole API has been deprecated in favor ofTabletGateway
, so we can ignore it. :ok_hand:master
tabletLet's look at the two critical issues which we intend to fix:
The way buffering works right now is by keeping a table of "buffers" for every keyspace and shard pair. This "shard buffer" is created lazily the first time a query for a given keyspace+shard arrives to the
vtgate
, and by default it is disabled, meaning that requests pass through directly and are not buffered. The first time that a request to a given keyspace/shard returns an error, the retry code inTabletGateway
will retry the buffering before retrying the request, this time passing the upstream error. If the buffering code detects that the upstream error is a failover error (this is explicitly checked at https://github.com/vitessio/vitess/blob/30250f51afd98dc1629b233959b60a916d88f7ff/go/vt/vtgate/buffer/buffer.go#L263-L290), then the buffer for the keyspace+shard will be marked as "enabled", and the current query will start buffering. Any subsequent queries that target the same keyspace+shard will also start buffering once they reach the buffering check, because their buffer is now marked as "enabled".This is what we mean by "lazy buffering": the only way a request will start buffering is once it has failed at least once for its keyspace+shard. The retry for this request, and any follow up requests, will then buffer. In an ideal design for this system, the buffering code will be proactively health checking the different shards in the cluster so that we can buffer a request right away once we detect a failover, without having to use an error response as the detection mechanism for the failover.
The other problem with this design is caused by the scoping to keyspace+shard which we've already discussed: this works fine for a master failover in a given shard, but fails critically during re-sharding operations. When we're performing a cutover after re-sharding a Vitess cluster, we're not only handling a master-slave failover for a specific shard, we're potentially changing the number of shards in the cluster and their respective masters. There is no clear way to detect once the failover is finished in a specific shard because the re-sharding operation necessarily affects more than one shard, and the resulting topology for our cluster can potentially force us to retry the buffered query in a different shard than the one we had initially planned before we started buffering.
The suggested fix
The fix that has been agreed on in our discussions so far implies lifting the buffering code so it happens way before the current checks in
TabletGateway
, and consequently, so it is not specifically scoped to a keyspace+shard pair.A suggested location to move our buffering logic would be in the
Executor
for thevtgates
, roughly after the initial planning has succeeded, but before the plan is executed:https://github.com/vitessio/vitess/blob/30250f51afd98dc1629b233959b60a916d88f7ff/go/vt/vtgate/plan_execute.go#L52-L82
The idea would be providing an API that takes the suggested plan and verifies whether the shards it affects need buffering (i.e. because they're being re-sharded or in the middle of a failover). If that's the case, we'd buffer the request until we detect that the whole re-sharding operation is complete and then we would retry the planning to ensure the final plan we execute is updated based on the cluster's topology after resharding.
This design assumes that we can implement a pro-active API for the buffering code that can tell us whether a given plan needs buffering before execution and, ideally, also whether the given plan would need to be re-planned after the buffering because it's been affected by a topology change. This second part is not mandatory for correctness, but feels like an important optimization for V2 -- for V1 of this implementation, we can assume that any plan that has been buffered before executing always needs to be re-planned, which is safe to assume.
However: this raises an issue with actual upstream retries. Right now this code is implemented in the
withRetry
helper forTabletGateway
which we've already discussed, and as part of the retrying logic, includes the potential buffering when the error being retried is detected as a failover error. What happens with this retry code once we lift the buffering logic out of it?My assumption is that
TabletGateway
needs to keep its retry logic (mostly because this logic includes the shuffling for the different tablets in a shard so they can be retried when one of them is down), but now we also need to add retrying logic higher in the stack (i.e. where we're now buffering), so that we can enable the buffering based on the return error in cases where we didn't detect a re-sharding or failover proactively. Does this sound correct? Will this extra retry logic introduce any unforseen consequences? For starters, it seems like any failover-related errors inTabletGateway
should be changed to be bubbled up immediately and not retried, but are there any other changes that need to be introduced?@sougou @harshit-gangal @rohit-nayak-ps: this is everything I understand about the problem so far. Could you please read through this and correct me where I'm wrong? I'll keep updating the issue in follow-up comments as I research further.