vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.54k stars 2.09k forks source link

vtctl vreplication workflow commands fail when there are unused shards in the target keyspace #9328

Open mattlord opened 2 years ago

mattlord commented 2 years ago

Overview of the Issue

VReplication workflows are orchestrated or driven by the primary tablets in the target keyspace. When you create a new vreplication workflow a record is inserted into the _vt.vreplication table on the primary tablets in the target keyspace and each target tablet then orchestrates things from there by first selecting a source tablet for its vstream. These records are then queried as you monitor the state and the records are updated as the workflow progresses and its state changes.

A side effect of these implementation details is that when you e.g. issue a vtctlclient -server=<server> MyTargetKeyspace.MyWorkflowName Show command vtctl first finds the PRIMARY tablets for each shard in TargetKeyspace and executes this SQL query against them to get the status of any relevant vreplication streams (vt_<keyspace> just being the default DB name that can be overridden with -init_db_name_override):

select id, source, message, cell, tablet_types from _vt.vreplication where workflow="MyWorkflowName" and db_name="vt_MyTargetKeyspace"

You can see this code here: https://github.com/vitessio/vitess/blob/release-12.0/go/vt/vtctl/workflow/traffic_switcher.go#L192-L216

You can run into problems then when e.g. you have an active MoveTables workflow running but during that process you realized you would need to Reshard the target keyspace, so you begin preparing the new shards ahead of time. When you get into this state you are forced to run InitShardPrimary on these new shards in the target keyspace even though you may not generally want them serving or otherwise available yet as w/o doing this you cannot execute any further vtctl vreplication workflow commands to monitor the state, complete, revert, or delete the existing workflow(s) in the keyspace.

Reproduction Steps

Using the docker_local container:

make docker_local && ./docker/local/run.sh

./201_customer_tablets.sh ; ./202_move_tables.sh

vtctlclient Workflow customer.commerce2customer Show

CELL=zone1 TABLET_UID=203 ./scripts/mysqlctl-up.sh
CELL=zone1 KEYSPACE=customer SHARD="-80" TABLET_UID=203 ./scripts/vttablet-up.sh

vtctlclient Workflow customer.commerce2customer Show

You will see that the final command produces an error:

Workflow Error: rpc error: code = Unknown desc = no primary found for shard -80
E1206 19:46:25.467165    5009 main.go:76] remote error: rpc error: code = Unknown desc = no primary found for shard -80

Binary version

Example:

vitess@13719a78c837:/vt/local$ vtgate --version
ERROR: logging before flag.Parse: E1206 19:47:19.925607    5031 syslogger.go:149] can't connect to syslog
Version: 13.0.0-SNAPSHOT (Git revision 2e22f46bbb branch 'main') built on Mon Dec  6 19:24:42 UTC 2021 by vitess@d0ee0b853a9e using go1.17 linux/amd64
mattlord commented 3 weeks ago

I was thinking about how to determine what shards to actually use in these cases and it's not so clear how we could correctly do this in various scenarios.

BUT, you can now specify what shards to operate on in various workflow commands. So on main/v21, building on the same basic test case:

git checkout main && make build

cd examples/local
alias vtctldclient='command vtctldclient --server=localhost:15999'

./101_initial_cluster.sh; mysql < ../common/insert_commerce_data.sql; ./201_customer_tablets.sh; ./202_move_tables.sh

CELL=zone1 TABLET_UID=300 ../common/scripts/mysqlctl-up.sh
SHARD=-80 CELL=zone1 KEYSPACE=customer TABLET_UID=300 ../common/scripts/vttablet-up.sh

vtctldclient MoveTables --workflow commerce2customer --target-keyspace customer show

vtctldclient MoveTables --workflow commerce2customer --target-keyspace customer --shards 0 show

The first show command returns nothing. The second one returns the expected output.

I'm thinking that this is a good solution here as the user can specific which of the serving shards you care about. What do you think @timvaillancourt and @arthurschreiber ?

timvaillancourt commented 2 weeks ago

The first show command returns nothing. The second one returns the expected output.

@mattlord I think the ability to specify shards is useful but if I understand everything correctly, the 1st show returning nothing feels potentially confusing to the user

mattlord commented 2 weeks ago

The first show command returns nothing. The second one returns the expected output.

@mattlord I think the ability to specify shards is useful but if I understand everything correctly, the 1st show returning nothing feels potentially confusing to the user

Yeah, that's a general issue today in vtctldclient GetWorkflows and vtctldclient <wf_type> show etc. It has nothing specifically to do with this discussion -- and it's something that I'd like to improve (how we handle cases where there are no matching workflow(s) returned).