VReplication commands are inefficient for workflows involving a limited number of shards

Overview of the Issue

VReplication workflow metadata is stored in each target shard that participates in the workflow, As a result, all workflow commands that show status or change the state of a workflow needs to check every shard to see if it is participating in the workflow.

This can get extremely inefficient when a large number of shards are involved and the workflow has only a few shards participating. Consider a 256-shard keyspace where we are doing a partial MoveTables for a single shard. For every Workflow Show vtctld will be contacting 256 primaries, though only one needs to be.

It gets worse for SwitchTraffic which could end up timing out because of the overhead involved initially in wasted calls to non-participating primaries.

We originally chose not to store any workflow information in the topo. Since workflows run distributed across shards synchronizing state in the topo would be hard with potential races.

A couple of options to resolve this:

Let each command take a "hint" of which shards are participating, so that we only contact those primaries. However this is not reliable and not convenient. User needs to keep track of which shards are involved. If only a subset is provided, it will only cause partial information to be returned for read-only commands likeShow/Progress. But for mutating commands like SwitchTraffic or Complete, we would end up with inconsistent processing which is unacceptable.
Store "read-only" information about the workflow in the topo in the target keyspace's record. These should only be written when a workflow is created and only non-mutating attributes (workflow name, workflow type, source shards, target shards) stored. None of these can be modified once a workflow is created. These can be used by the VReplication commands to filter the participating primaries. If a cache entry is not found (for existing workflows) we could either just continue with the current behaviour or attempt to populate this cache entry, with appropriate locking.

vitessio / vitess

VReplication commands are inefficient for workflows involving a limited number of shards #14204

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments