vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.55k stars 2.09k forks source link

Bug Report: Pre-existing Tablet Controls breaks MoveTables SwitchTraffic #13999

Closed FancyFane closed 5 months ago

FancyFane commented 1 year ago

Overview of the Issue

When there are pre-populated tablet controls on the target keyspace, MoveTables SwitchTraffic will break with an error that requires manual cleanup before reads and writes can resume. This occurs, when the TabletControls has a list of denied tables rules that don't match the currently running workflow. If the workflow's tables don't match the TabletControls 1 for 1; then an error results.

Any traffic sent after this point will result in continued errors from the application until we removed the TabletControls and Refreshed the Shard State.

Related Issue: #13998

Reproduction Steps

  1. Do a MoveTables with 6 sbtest databases; SwitchTraffic, ReverseTraffic; then cancel the workflow. This will result in an environment with Tablet Controls in place on the target and no running workflow.

See Issue: #13998

$ vtctlclient --server :15999 GetShard fane_import_sharded/-80
{
...
  "tablet_controls": [
    {
      "tablet_type": 1,
      "cells": [],
      "denied_tables": [
        "sbtest1",
        "sbtest2",
        "sbtest3",
        "sbtest4",
        "sbtest5",
        "sbtest6",
        "testing"
      ],
...
}
  1. Add two new sbtest tables on your source; and start up a new workflow; NOTE when you see the matching tables you'll see tables sbtest1-8; however, the tablet controls are only for sbtest1-6.
$ vtctlclient --server :15999 Workflow fane_import_sharded.import-shard-80 show
{
    "Workflow": "import-shard-80",
    "SourceLocation": {
        "Keyspace": "fane_import_sharded_source",
        "Shards": [
            "-80"
        ]
    },
    "TargetLocation": {
        "Keyspace": "fane_import_sharded",
        "Shards": [
            "-80"
        ]
    },
    "MaxVReplicationLag": 1,
    "MaxVReplicationTransactionLag": 1,
    "Frozen": false,
    "ShardStatuses": {
        "-80/aws_useast1a_6-3337899395": {
            "PrimaryReplicationStatuses": [
                {
                    "Shard": "-80",
                    "Tablet": "aws_useast1a_6-3337899395",
                    "ID": 6,
                    "Bls": {
                        "keyspace": "fane_import_sharded_source",
                        "shard": "-80",
                        "filter": {
                            "rules": [
                                {
                                    "match": "sbtest1",
                                    "filter": "select * from sbtest1 where in_keyrange(id, 'fane_import_sharded.hash', '-80')"
                                },
                                {
                                    "match": "sbtest2",
                                    "filter": "select * from sbtest2 where in_keyrange(id, 'fane_import_sharded.hash', '-80')"
                                },
                                {
                                    "match": "sbtest3",
                                    "filter": "select * from sbtest3 where in_keyrange(id, 'fane_import_sharded.hash', '-80')"
                                },
                                {
                                    "match": "sbtest4",
                                    "filter": "select * from sbtest4 where in_keyrange(id, 'fane_import_sharded.hash', '-80')"
                                },
                                {
                                    "match": "sbtest5",
                                    "filter": "select * from sbtest5 where in_keyrange(id, 'fane_import_sharded.hash', '-80')"
                                },
                                {
                                    "match": "sbtest6",
                                    "filter": "select * from sbtest6 where in_keyrange(id, 'fane_import_sharded.hash', '-80')"
                                },
                                {
                                    "match": "sbtest7",
                                    "filter": "select * from sbtest7 where in_keyrange(id, 'fane_import_sharded.hash', '-80')"
                                },
                                {
                                    "match": "sbtest8",
                                    "filter": "select * from sbtest8 where in_keyrange(id, 'fane_import_sharded.hash', '-80')"
                                },
                                {
                                    "match": "testing",
                                    "filter": "select * from testing"
                                }
                            ]
                        }
                    },
                    "Pos": "7c3368f8-5412-11ee-8179-0a26551b1c25:1-1584,7c434390-5412-11ee-8c60-0a26551b1c25:1",
                    "StopPos": "",
                    "State": "Running",
                    "DBName": "fane_import_sharded",
                    "TransactionTimestamp": 0,
                    "TimeUpdated": 1694815788,
                    "TimeHeartbeat": 1694815788,
                    "TimeThrottled": 0,
                    "ComponentThrottled": "",
                    "Message": "",
                    "Tags": "",
                    "WorkflowType": "MoveTables",
                    "WorkflowSubType": "Partial",
                    "CopyState": null,
                    "RowsCopied": 0
                }
            ],
            "TabletControls": [
                {
                    "tablet_type": 1,
                    "denied_tables": [
                        "sbtest1",
                        "sbtest2",
                        "sbtest3",
                        "sbtest4",
                        "sbtest5",
                        "testing"
                    ]
                }
            ],
            "PrimaryIsServing": true
        }
    },
    "SourceTimeZone": "",
    "TargetTimeZone": ""
}
  1. Performing a SwitchTraffic fails:
$ vtctlclient --server :15999 MoveTables SwitchTraffic fane_import_sharded.import-shard-80     
E0915 22:10:10.097662     696 main.go:96] E0915 22:10:10.097104 traffic_switcher.go:625] allowTargetWrites failed: Code: INVALID_ARGUMENT
cannot remove tables since one or more do not exist in the denylist
E0915 22:10:10.114269     696 main.go:96] E0915 22:10:10.113676 vtctl.go:2215] 
cannot remove tables since one or more do not exist in the denylist

The following vreplication streams exist for workflow fane_import_sharded.import-shard-80:

id=6 on -80/aws_useast1a_6-3337899395: Status: Stopped. VStream Lag: 0s.

MoveTables Error: rpc error: code = Unknown desc = cannot remove tables since one or more do not exist in the denylist
E0915 22:10:10.216399     696 main.go:105] remote error: rpc error: code = Unknown desc = cannot remove tables since one or more do not exist in the denylist
  1. Any writes done to the keyspace from the application during this time results in an error:
$ sysbench --db-driver=mysql --threads=1 --events=0 --time=0 --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-db=fane_import_sharded /usr/share/sysbench/oltp_insert.lua --tables=5 run
WARNING: Both event and time limits are disabled, running an endless test
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Initializing worker threads...

Threads started!

FATAL: mysql_drv_query() returned error 1105 (target: fane_import_sharded_source.-80.primary: vttablet: rpc error: code = FailedPrecondition desc = disallowed due to rule: enforce denied tables (CallerID: admin)) for query 'INSERT INTO sbtest4 (id, k, c, pad) VALUES (0, 4098, '09169823527-14773847787-63328771402-43563606289-98835554319-17838113855-09276254645-46412092895-40264640011-92712584350', '67793249909-86081288100-12979568721-26815841297-77951231372')'
FATAL: `thread_run' function failed: /usr/share/sysbench/oltp_insert.lua:61: SQL error, errno = 1105, state = 'HY000': target: fane_import_sharded_source.-80.primary: vttablet: rpc error: code = FailedPrecondition desc = disallowed due to rule: enforce denied tables (CallerID: admin)

Recovery Steps

  1. (recovery step) The way to recovery here is to remove the tablet controls and refresh the shard state on the SOURCE:
vtctldclient --server localhost:15999 SetShardTabletControl --remove fane_import_sharded_source/-80 primary; 
vtctldclient --server localhost:15999 RefreshStateByShard fane_import_sharded_source/-80;
  1. (recovery step) Now any writes from the application will continue to run.
$ sysbench --db-driver=mysql --threads=1 --events=0 --time=0 --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-db=fane_import_sharded /usr/share/sysbench/oltp_insert.lua --tables=5 run
WARNING: Both event and time limits are disabled, running an endless test
sysbench 1.0.20 (using system LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Initializing worker threads...

Threads started!

Binary Version

Vitess 16.0.3

Operating System and Environment details

n/a

Log Fragments

n/a
frouioui commented 1 year ago

@vitessio/vreplication

rohit-nayak-ps commented 5 months ago

Fixed via https://github.com/vitessio/vitess/pull/14008