Request Buffering during Sharding

I have started working on this functionality for Vitess. To ellaborate on the problem, which is already listed in https://github.com/vitessio/vitess/issues/7061: we want to improve the way we perform buffering to be able to handle more complex resharding situations.

Currently, the main entrypoint for buffering is in tabletgateway.go, in the TabletGateway.withRetry helper that wraps all the method calls for our QueryService:

https://github.com/vitessio/vitess/blob/4bebb8e5a6097772bcb8a4d1c187f14782158069/go/vt/vtgate/tabletgateway.go#L216-L234

Here's a few important things to note:

There is another buffering entrypoint in DiscoveryGateway, but this whole API has been deprecated in favor of TabletGateway, so we can ignore it. :ok_hand:
Buffering does not happen for queries in an existing transaction because that's never safe to do
Buffering only happens for queries that target a master tablet
Critically: buffering only happens lazily, once a query has failed at least once and is being retried, which is a behavior we also want to change
Also critically: buffering is scoped for each keyspace and shard pair.

Let's look at the two critical issues which we intend to fix:

The way buffering works right now is by keeping a table of "buffers" for every keyspace and shard pair. This "shard buffer" is created lazily the first time a query for a given keyspace+shard arrives to the vtgate, and by default it is disabled, meaning that requests pass through directly and are not buffered. The first time that a request to a given keyspace/shard returns an error, the retry code in TabletGateway will retry the buffering before retrying the request, this time passing the upstream error. If the buffering code detects that the upstream error is a failover error (this is explicitly checked at https://github.com/vitessio/vitess/blob/30250f51afd98dc1629b233959b60a916d88f7ff/go/vt/vtgate/buffer/buffer.go#L263-L290), then the buffer for the keyspace+shard will be marked as "enabled", and the current query will start buffering. Any subsequent queries that target the same keyspace+shard will also start buffering once they reach the buffering check, because their buffer is now marked as "enabled".

This is what we mean by "lazy buffering": the only way a request will start buffering is once it has failed at least once for its keyspace+shard. The retry for this request, and any follow up requests, will then buffer. In an ideal design for this system, the buffering code will be proactively health checking the different shards in the cluster so that we can buffer a request right away once we detect a failover, without having to use an error response as the detection mechanism for the failover.

The other problem with this design is caused by the scoping to keyspace+shard which we've already discussed: this works fine for a master failover in a given shard, but fails critically during re-sharding operations. When we're performing a cutover after re-sharding a Vitess cluster, we're not only handling a master-slave failover for a specific shard, we're potentially changing the number of shards in the cluster and their respective masters. There is no clear way to detect once the failover is finished in a specific shard because the re-sharding operation necessarily affects more than one shard, and the resulting topology for our cluster can potentially force us to retry the buffered query in a different shard than the one we had initially planned before we started buffering.

The suggested fix

The fix that has been agreed on in our discussions so far implies lifting the buffering code so it happens way before the current checks in TabletGateway, and consequently, so it is not specifically scoped to a keyspace+shard pair.

A suggested location to move our buffering logic would be in the Executor for the vtgates, roughly after the initial planning has succeeded, but before the plan is executed:

https://github.com/vitessio/vitess/blob/30250f51afd98dc1629b233959b60a916d88f7ff/go/vt/vtgate/plan_execute.go#L52-L82

The idea would be providing an API that takes the suggested plan and verifies whether the shards it affects need buffering (i.e. because they're being re-sharded or in the middle of a failover). If that's the case, we'd buffer the request until we detect that the whole re-sharding operation is complete and then we would retry the planning to ensure the final plan we execute is updated based on the cluster's topology after resharding.

This design assumes that we can implement a pro-active API for the buffering code that can tell us whether a given plan needs buffering before execution and, ideally, also whether the given plan would need to be re-planned after the buffering because it's been affected by a topology change. This second part is not mandatory for correctness, but feels like an important optimization for V2 -- for V1 of this implementation, we can assume that any plan that has been buffered before executing always needs to be re-planned, which is safe to assume.

However: this raises an issue with actual upstream retries. Right now this code is implemented in the withRetry helper for TabletGateway which we've already discussed, and as part of the retrying logic, includes the potential buffering when the error being retried is detected as a failover error. What happens with this retry code once we lift the buffering logic out of it?

My assumption is that TabletGateway needs to keep its retry logic (mostly because this logic includes the shuffling for the different tablets in a shard so they can be retried when one of them is down), but now we also need to add retrying logic higher in the stack (i.e. where we're now buffering), so that we can enable the buffering based on the return error in cases where we didn't detect a re-sharding or failover proactively. Does this sound correct? Will this extra retry logic introduce any unforseen consequences? For starters, it seems like any failover-related errors in TabletGateway should be changed to be bubbled up immediately and not retried, but are there any other changes that need to be introduced?

@sougou @harshit-gangal @rohit-nayak-ps: this is everything I understand about the problem so far. Could you please read through this and correct me where I'm wrong? I'll keep updating the issue in follow-up comments as I research further.

More thoughts on proactively detecting re-sharding operations and failovers:

As explained on the previous comment, right now there's a shardBuffer data structure that is lazily allocated for every keyspace+shard pair on first access, and whose buffering logic is only activated the first time there's an error in the destination shard. This design doesn't seem feasible once we lift this logic higher in the stack.

I think the solution we're looking for is a global map (for each vtgate) of "cluster events" that is updated in real time. Each event would contain information about what keyspaces and shards are affected, and then we'd just use auxiliary data structures so that when a query plan comes in, we could efficiently detect whether any of the shards where the plan would be executed are currently affected by any events, and if so, so we could buffer the plan before attempting it. Likewise, the "event" itself would keep a goroutine actively polling the health of the cluster so that it can detect its own resolution, and when that happens, it would flush all the currently buffered requests that were waiting on the event.

Does this seem reasonable? I think the hardest problem about this is the case where we don't proactively detect an event in the cluster, and instead we receive a failover error from one of the tablets. Converting this tablet (and hence shard-) specific error into a global event that could potentially affect more than one shard seems complicated. How would we handle this?

DiscoveryGateway is deprecated, replaced with TabletGateway. The logic in withRetry is pretty much identical in the two, so this doesn't make a material difference to the design. We probably don't want to implement the new buffering in DiscoveryGateway which might require us to keep the current shard_buffer logic around (or maybe we go ahead and delete the deprecated gateway now).
vtgate keeps track of the health of individual tablets, but it also has the notion of a Target - i.e. keyspace/shard/tablet_type. When a failover succeeds, the promotion event is sent to subscribers of this healthcheck module. This is probably the part of code to look at next to see if it can be adapted easily to provide the information we need to detect the end of a resharding cutover. https://github.com/vitessio/vitess/blob/main/go/vt/discovery/healthcheck.go#L514-L523
For the rest, this approach seems promising except for the one open question.

There are currently two code paths for query execution (Execute and StreamExecute), both should be covered by buffering.

It is ok to just make it for that shard event for which the tablet received the error. This way it will accumulate itself if errors are received from more tablets.

DiscoveryGateway is deprecated, replaced with TabletGateway. The logic in withRetry is pretty much identical in the two, so this doesn't make a material difference to the design. We probably don't want to implement the new buffering in DiscoveryGateway which might require us to keep the current shard_buffer logic around (or maybe we go ahead and delete the deprecated gateway now).

Thank you for pointing this out! Yes, the logic for retries looks copy-and-pasted between the two Gateways, so let's scope this down to TabletGateway. I'll update the issue accordingly.

OK, I've spent this afternoon looking at the health checking code. I have questions about it: using this comment from @rohit-nayak-ps as a reference, here's how he believes we could model the two events (start & end of a resharding):

RESHARDING_CUTOVER_STARTED: "If Shard found in vtgate's plan has its ShardTabletControl disabled then don't execute and buffer" -- I think I can follow this one. Basically we need to acquire the keyspace info from a topo.Server, for the keyspace where the query's plan will execute (a query plan will only execute in a single keyspace -- is this correct?). So, to sketch this out:

srvKeyspace, err := ts.GetSrvKeyspace(ctx, cell, keyspace)
if err != nil {
    return err
}
for _, partition := range srvKeyspace.GetPartitions() {
    for _, shardTabletControl := range partition.GetShardTabletControls() {
        if shardTabletControl.QueryServiceDisabled {
            // this query needs to buffer
        }
    }
}

Questions about this: we clearly know the keyspace for a plan, but what's its cell? Do we need to use GetSrvKeyspaceAllCells? Is this even something that we could run once-per query? This would be powered by ResilientServer's caching, because reaching to topo.Server directly (i.e. etcd) for every single call would be unfeasible. Is the caching here going to lead to racy issues though?

RESHARDING_CUTOVER_COMPLETED: "SrvKeyspace watcher detects change in partition.ShardReferences and stops buffering"

The only place where WatchSrvKeyspace is being called right now is from the ResilientServer -- it's used for proactively fetching keyspace information so that GetSrvKeyspace returns up to date information. This usage of WatchSrvKeyspace however does not expose the actual "diff" between the previous version of the keyspace metadata, which is what we would need to detect changes in ShardReferences. Is this functionality that you think we should bolt on top of ResilientServer, or do we need a separate implementation, even though that would actually duplicate the amount of Watch connections into the upstream topo server (etcd, ...) for each vtgate?

There might be a simpler way to detect cutover start and stop. In the original comment @rohit-nayak-ps says that when cutover starts we transition the primary of the old shard to NON_SERVING, and when cutover ends we transition the primaries of new shards to SERVING. Serving state is available in healthcheck and we could use that to notify the buffering code. This will side-step the topo/cell issue altogether.

when cutover ends we transition the primaries of new shards to SERVING.

I've been playing with this and it seems complicated because the primaries of the new shards are transitioned asynchronously. The HealthCheck API lets us subscribe to all new Tablet Health events through a channel, but these events come through one-by-one.

Say that we're attempting to run a query that targets shard X: we see that the primary of shard X is currently not serving, so we start buffering the query. Since we receive shard health updates one by one, we'll eventually receive a health update for a shard we've never seen before (the result of the re-sharding), which won't be serving, and then another update that marks that shard as serving. The issue here is that re-sharding will usually (always?) result in more than one new shard, but we do not know how many new shards are we expecting to see in the HealthCheck stream.

The best way to work around this issue is trying to guess when we have access to the whole keyspace for a shard by parsing and putting together the name of the shards -- I am just not sure whether this is always safe to do for all sharding cases (and whether we have existing code in Vitess to parse shard names and see if they add up to the whole keyspace).

The issue here is that re-sharding will usually (always?) result in more than one new shard, but we do not know how many new shards are we expecting to see in the HealthCheck stream.

Eventually we also need to worry about "Merging" shards where we may end up with fewer shards (as few as one).

Eventually we also need to worry about "Merging" shards where we may end up with fewer shards (as few as one).

Right, I thought about this use case, and from my understanding, the resulting shard names for the merge should also add up to the whole keyspace. Is there any other sharding operation that could result in a partial keyspace and hence couldn't be detected by this heuristic?

At the keyspace level it should be safe to decide when we have access to the whole keyspace by parsing and putting together the names of the shards. There are people doing large reshards where they break it up into individual vreplication streams and manage the cutovers themselves. This won't work for those cases. We might have to explicitly warn that buffering should not be used in those cases. This also means that we should introduce a new flag for buffering during resharding cutover and not trigger it using the existing buffering flag enable_buffer.

I think, we should put this information upfront in the topo server, so that we do not have to assume it. Whenever such operation happens, it is stored in topo server to be discovered by buffering code.

Following up on the feedback on #8514, I've been researching other approaches to detect these sharding operations proactively that do not depend on the topology server.

@harshit-gangal suggested to run the example resharding workflows in the Vitess examples locally and analyze the resulting health check event patterns to see how much information we can gather from it. It's not looking good.

For shard initialization:

for i in 300 301 302; do
    CELL=zone1 TABLET_UID=$i ./scripts/mysqlctl-up.sh
    SHARD=-80 CELL=zone1 KEYSPACE=customer TABLET_UID=$i ./scripts/vttablet-up.sh
done

for i in 400 401 402; do
    CELL=zone1 TABLET_UID=$i ./scripts/mysqlctl-up.sh
    SHARD=80- CELL=zone1 KEYSPACE=customer TABLET_UID=$i ./scripts/vttablet-up.sh
done

vtctldclient InitShardPrimary --force customer/-80 zone1-300
vtctldclient InitShardPrimary --force customer/80- zone1-400

Here are the actual health status changes as seen on the vtgate:

I0726 13:15:32.928007   40734 tabletgateway.go:134] 
[HEALTH CHECK zone1-401]
    NEW: {"Tablet":{"alias":{"cell":"zone1","uid":401},"hostname":"three-seagrass-linux","port_map":{"grpc":16401,"vt":15401},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17401},"Target":{"keyspace":"customer","shard":"80-","tablet_type":2},"MasterTermStartTime":0,"Serving":false}
I0726 13:15:32.928001   40734 healthcheck.go:328] Adding tablet to healthcheck: alias:{cell:"zone1" uid:400} hostname:"three-seagrass-linux" port_map:{key:"grpc" value:16400} port_map:{key:"vt" value:15400} keyspace:"customer" shard:"80-" key_range:{start:"\x80"} type:REPLICA mysql_hostname:"three-seagrass-linux" mysql_port:17400
I0726 13:15:32.928025   40734 tabletgateway.go:134] 
[HEALTH CHECK zone1-301]
    NEW: {"Tablet":{"alias":{"cell":"zone1","uid":301},"hostname":"three-seagrass-linux","port_map":{"grpc":16301,"vt":15301},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17301},"Target":{"keyspace":"customer","shard":"-80","tablet_type":2},"MasterTermStartTime":0,"Serving":false}
I0726 13:15:32.928042   40734 tabletgateway.go:134] 
[HEALTH CHECK zone1-300]
    NEW: {"Tablet":{"alias":{"cell":"zone1","uid":300},"hostname":"three-seagrass-linux","port_map":{"grpc":16300,"vt":15300},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17300},"Target":{"keyspace":"customer","shard":"-80","tablet_type":2},"MasterTermStartTime":0,"Serving":false}
I0726 13:15:32.928056   40734 tabletgateway.go:134] 
[HEALTH CHECK zone1-400]
    NEW: {"Tablet":{"alias":{"cell":"zone1","uid":400},"hostname":"three-seagrass-linux","port_map":{"grpc":16400,"vt":15400},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17400},"Target":{"keyspace":"customer","shard":"80-","tablet_type":2},"MasterTermStartTime":0,"Serving":false}
I0726 13:15:32.928717   40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-400 (three-seagrass-linux) serving false => false for customer/80- (REPLICA) reason: healthCheck update error: vttablet error: no replication status
I0726 13:15:32.928717   40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-302 (three-seagrass-linux) serving false => false for customer/-80 (RDONLY) reason: healthCheck update error: vttablet error: no replication status
I0726 13:15:32.928739   40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-301 (three-seagrass-linux) serving false => false for customer/-80 (REPLICA) reason: healthCheck update error: vttablet error: no replication status
I0726 13:15:32.928743   40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-401 (three-seagrass-linux) serving false => false for customer/80- (REPLICA) reason: healthCheck update error: vttablet error: no replication status
I0726 13:15:32.928797   40734 tabletgateway.go:134] 
[HEALTH CHECK zone1-302]
    NEW: {"Tablet":{"alias":{"cell":"zone1","uid":302},"hostname":"three-seagrass-linux","port_map":{"grpc":16302,"vt":15302},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":3,"mysql_hostname":"three-seagrass-linux","mysql_port":17302},"Target":{"keyspace":"customer","shard":"-80","tablet_type":3},"MasterTermStartTime":0,"Serving":false}
I0726 13:15:32.928808   40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-300 (three-seagrass-linux) serving false => false for customer/-80 (REPLICA) reason: healthCheck update error: vttablet error: no replication status
I0726 13:15:33.900173   40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-300 (three-seagrass-linux) serving false => true for customer/-80 (MASTER) reason: healthCheck update
E0726 13:15:33.900217   40734 healthcheck.go:483] Adding 1 to MasterPromoted counter for target: keyspace:"customer" shard:"-80" tablet_type:REPLICA, tablet: zone1-0000000300, tabletType: MASTER
I0726 13:15:33.900700   40734 tabletgateway.go:138] 
[HEALTH CHECK zone1-300]
    OLD: {"Tablet":{"alias":{"cell":"zone1","uid":300},"hostname":"three-seagrass-linux","port_map":{"grpc":16300,"vt":15300},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17300},"Target":{"keyspace":"customer","shard":"-80","tablet_type":2},"MasterTermStartTime":0,"Serving":false}
    NEW: {"Tablet":{"alias":{"cell":"zone1","uid":300},"hostname":"three-seagrass-linux","port_map":{"grpc":16300,"vt":15300},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17300},"Target":{"keyspace":"customer","shard":"-80","tablet_type":1},"MasterTermStartTime":1627298133,"Serving":true}
I0726 13:15:34.508368   40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-400 (three-seagrass-linux) serving false => true for customer/80- (MASTER) reason: healthCheck update
E0726 13:15:34.508395   40734 healthcheck.go:483] Adding 1 to MasterPromoted counter for target: keyspace:"customer" shard:"80-" tablet_type:REPLICA, tablet: zone1-0000000400, tabletType: MASTER
I0726 13:15:34.508503   40734 tabletgateway.go:138] 
[HEALTH CHECK zone1-400]
    OLD: {"Tablet":{"alias":{"cell":"zone1","uid":400},"hostname":"three-seagrass-linux","port_map":{"grpc":16400,"vt":15400},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17400},"Target":{"keyspace":"customer","shard":"80-","tablet_type":2},"MasterTermStartTime":0,"Serving":false}
    NEW: {"Tablet":{"alias":{"cell":"zone1","uid":400},"hostname":"three-seagrass-linux","port_map":{"grpc":16400,"vt":15400},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17400},"Target":{"keyspace":"customer","shard":"80-","tablet_type":1},"MasterTermStartTime":1627298134,"Serving":true}
I0726 13:15:34.594006   40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-301 (three-seagrass-linux) serving false => true for customer/-80 (REPLICA) reason: healthCheck update
I0726 13:15:34.594076   40734 tabletgateway.go:138] 
[HEALTH CHECK zone1-301]
    OLD: {"Tablet":{"alias":{"cell":"zone1","uid":301},"hostname":"three-seagrass-linux","port_map":{"grpc":16301,"vt":15301},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17301},"Target":{"keyspace":"customer","shard":"-80","tablet_type":2},"MasterTermStartTime":0,"Serving":false}
    NEW: {"Tablet":{"alias":{"cell":"zone1","uid":301},"hostname":"three-seagrass-linux","port_map":{"grpc":16301,"vt":15301},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17301},"Target":{"keyspace":"customer","shard":"-80","tablet_type":2},"MasterTermStartTime":0,"Serving":true}
I0726 13:15:35.098106   40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-302 (three-seagrass-linux) serving false => true for customer/-80 (RDONLY) reason: healthCheck update
I0726 13:15:35.098197   40734 tabletgateway.go:138] 
[HEALTH CHECK zone1-302]
    OLD: {"Tablet":{"alias":{"cell":"zone1","uid":302},"hostname":"three-seagrass-linux","port_map":{"grpc":16302,"vt":15302},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":3,"mysql_hostname":"three-seagrass-linux","mysql_port":17302},"Target":{"keyspace":"customer","shard":"-80","tablet_type":3},"MasterTermStartTime":0,"Serving":false}
    NEW: {"Tablet":{"alias":{"cell":"zone1","uid":302},"hostname":"three-seagrass-linux","port_map":{"grpc":16302,"vt":15302},"keyspace":"customer","shard":"-80","key_range":{"end":"gA=="},"type":3,"mysql_hostname":"three-seagrass-linux","mysql_port":17302},"Target":{"keyspace":"customer","shard":"-80","tablet_type":3},"MasterTermStartTime":0,"Serving":true}
I0726 13:15:35.509942   40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-401 (three-seagrass-linux) serving false => true for customer/80- (REPLICA) reason: healthCheck update
I0726 13:15:35.510027   40734 tabletgateway.go:138] 
[HEALTH CHECK zone1-401]
    OLD: {"Tablet":{"alias":{"cell":"zone1","uid":401},"hostname":"three-seagrass-linux","port_map":{"grpc":16401,"vt":15401},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17401},"Target":{"keyspace":"customer","shard":"80-","tablet_type":2},"MasterTermStartTime":0,"Serving":false}
    NEW: {"Tablet":{"alias":{"cell":"zone1","uid":401},"hostname":"three-seagrass-linux","port_map":{"grpc":16401,"vt":15401},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":2,"mysql_hostname":"three-seagrass-linux","mysql_port":17401},"Target":{"keyspace":"customer","shard":"80-","tablet_type":2},"MasterTermStartTime":0,"Serving":true}
I0726 13:16:32.927803   40734 healthcheck.go:328] Adding tablet to healthcheck: alias:{cell:"zone1" uid:402} hostname:"three-seagrass-linux" port_map:{key:"grpc" value:16402} port_map:{key:"vt" value:15402} keyspace:"customer" shard:"80-" key_range:{start:"\x80"} type:RDONLY mysql_hostname:"three-seagrass-linux" mysql_port:17402
I0726 13:16:32.927933   40734 tabletgateway.go:134] 
[HEALTH CHECK zone1-402]
    NEW: {"Tablet":{"alias":{"cell":"zone1","uid":402},"hostname":"three-seagrass-linux","port_map":{"grpc":16402,"vt":15402},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":3,"mysql_hostname":"three-seagrass-linux","mysql_port":17402},"Target":{"keyspace":"customer","shard":"80-","tablet_type":3},"MasterTermStartTime":0,"Serving":false}
I0726 13:16:32.928611   40734 tablet_health_check.go:111] HealthCheckUpdate(Serving State): tablet: zone1-402 (three-seagrass-linux) serving false => true for customer/80- (RDONLY) reason: healthCheck update
I0726 13:16:32.928700   40734 tabletgateway.go:138] 
[HEALTH CHECK zone1-402]
    OLD: {"Tablet":{"alias":{"cell":"zone1","uid":402},"hostname":"three-seagrass-linux","port_map":{"grpc":16402,"vt":15402},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":3,"mysql_hostname":"three-seagrass-linux","mysql_port":17402},"Target":{"keyspace":"customer","shard":"80-","tablet_type":3},"MasterTermStartTime":0,"Serving":false}
    NEW: {"Tablet":{"alias":{"cell":"zone1","uid":402},"hostname":"three-seagrass-linux","port_map":{"grpc":16402,"vt":15402},"keyspace":"customer","shard":"80-","key_range":{"start":"gA=="},"type":3,"mysql_hostname":"three-seagrass-linux","mysql_port":17402},"Target":{"keyspace":"customer","shard":"80-","tablet_type":3},"MasterTermStartTime":0,"Serving":true}

For the resharding step:

vtctlclient Reshard customer.cust2cust '0' '-80,80-'

There are no health events in vtgate. This is expected because we're only setting up the resharding streams, not doing a cutover.

When switching reads:

vtctlclient SwitchReads -tablet_type=rdonly customer.cust2cust
vtctlclient SwitchReads -tablet_type=replica customer.cust2cust

There are no health events in vtgate either. This is already a bit fishy. None of the tablets change state (at least not visibly to the vtgate health check). Shouldn't this be part of the cutover?

When switching writes:

vtctlclient SwitchWrites customer.cust2cust

Here's the other not-cromulent thing I'm seeing: there's just a single event in the HealthCheck stream here, the old tablet 200 going from serving to not serving:

I0726 15:49:21.076913   55874 tabletgateway.go:138] 
[HEALTH CHECK zone1-200]
    OLD: {"Tablet":{"alias":{"cell":"zone1","uid":200},"hostname":"three-seagrass-linux","port_map":{"grpc":16200,"vt":15200},"keyspace":"customer","shard":"0","type":1,"mysql_hostname":"three-seagrass-linux","mysql_port":17200,"master_term_start_time":{"seconds":1627305552,"nanoseconds":29687806}},"Target":{"keyspace":"customer","shard":"0","tablet_type":1},"MasterTermStartTime":1627305552,"Serving":true}
    NEW: {"Tablet":{"alias":{"cell":"zone1","uid":200},"hostname":"three-seagrass-linux","port_map":{"grpc":16200,"vt":15200},"keyspace":"customer","shard":"0","type":1,"mysql_hostname":"three-seagrass-linux","mysql_port":17200,"master_term_start_time":{"seconds":1627305552,"nanoseconds":29687806}},"Target":{"keyspace":"customer","shard":"0","tablet_type":1},"MasterTermStartTime":1627305552,"Serving":false}

This is not quite than what I heard from @rohit-nayak-ps / @deepthi on previous comments:

when cutover starts we transition the primary of the old shard to NON_SERVING, and when cutover ends we transition the primaries of new shards to SERVING.

We can see how the primary of the old shard becomes non-serving, but the primaries of the new shards have been marked as SERVING since they were initially instantiated. Is this an issue with the way our examples are written (i.e. this behavior is not what we'd see in a production cluster)? Or is the previous description just not accurate at all?

Update on this:

We had a meeting last week where we made some decisions on how to approach this design without over-relying on the topology server. The two key points we agreed on are:

We think that handling failover/resharding errors from the vttablet directly is a good enough solution, and more reliable than trying to pro-actively find these events directly on the topology server. In order to accomplish this for all cases, we need to change how vttablet returns these errors, so they're always properly tagged as failover errors, and we also need to refactor the way these are handled and scoped up in the vtgate, so they're propagated high enough to be handled by the buffering code, which as discussed previously in this issue, needs to hook into the call path higher than it does right now in order to allow for replanning the whole query before retrying it.
To detect the end of a failover/resharding event, we're going to use the topology watcher like we've done in the past. This doesn't make the situation any worse. We expect to find two kind of events after which buffering should be aborted, which can be detected by:
- watching SrvKeyspace changes, which should let us know when a failover or a resharding operation is finished
- watching SrvVSchema changes, which should let us know when a MoveTables operation is finished (this is another failover-like event that requires buffering and cannot be detected via SrvKeyspace changes).
To support this use case, I've refactored the resilient server in https://github.com/vitessio/vitess/pull/8583 so we can watch both event types and so watching coalesces concurrent requests without affecting performance.

I hope this is a good enough summary! Work on step 2 is over, I'm now tackling step 1 which will require changing both vttablet and vtgate.

Thank you for the update @vmg . This all makes sense to me.

vitessio / vitess

Request Buffering during Sharding #8462

The suggested fix