Closed mdlayher closed 9 months ago
As a note, I've seen similar issues with vtorc and would encourage adding concurrency there as well:
2023-12-01 00:08:51.248
GetShard(test, 84-86) failed
2023-12-01 00:08:51.248
E1201 00:08:51.248059 1 keyspace_shard_discovery.go:129] deadline exceeded: /vitess/user-data/vitess-xxx/global/keyspaces/test/shards/84-86/Shard
As a note, I've seen similar issues with vtorc and would encourage adding concurrency there as well: @GuptaManan100, cc for visibility
Thank-you @mdlayher for fixing it for VTOrc too ❤️
Overview of the Issue
When deploying a Vitess keyspace with 128 shards and employing a federated toposerver distributed amongst different geographical regions, we have found that calls to vtctld's
RebuildKeyspaceGraph
can take 30+ seconds to complete.We have tracked this down to the topo server's
FindAllShardsInKeyspace
:Where the error is related to context cancelation after 30+ seconds.
I will send a PR.
Reproduction Steps
Spin up a large sharded keyspace with federated topo server and attempt to issue
RebuildKeyspaceGraph
via vtctld.Binary Version
Operating System and Environment details
Log Fragments
Note that the action fails after approximately 30 seconds to caller timeout.