local_metadata table left in inconsistent state between commits

The Problem

With the changes introduced to the _vt.local_metadata table between https://github.com/vitessio/vitess/pull/4727 and https://github.com/vitessio/vitess/pull/4830, we are unable to go back and forth between versions of vttablet without putting the _vt.local_metadata in an inconsistent state.

We were trying to move from this release, which doesn't have any of the aforementioned changes, to one that does contain them. When we did that, the state our tablet _vt.local_metadata records were something like this:

mysql> select * from _vt.local_metadata;
+---------------+-----------------------+---------------+
| name          | value                 | db_name       |
+---------------+-----------------------+---------------+
| Alias         | alias_1               | vt_keyspace   |
| ClusterAlias  | keyspace  .-          | vt_keyspace   |
| DataCenter    | datacenter_1          | vt_keyspace   |
| PromotionRule | neutral               | vt_keyspace   |
+---------------+-----------------------+---------------+
4 rows in set (0.00 sec)

with a schema like:

CREATE TABLE `local_metadata` (
  `name` varchar(255) NOT NULL,
  `value` varchar(255) NOT NULL,
  `db_name` varbinary(255) NOT NULL DEFAULT '',
  PRIMARY KEY (`name`,`db_name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4

After we found a separate regression with tablet configuration, we had to roll this build back, resulting in this same table now looking like this:

+---------------+-----------------------+---------------+
| name          | value                 | db_name       |
+---------------+-----------------------+---------------+
| Alias         | alias_1               | vt_keyspace   |
| Alias         | alias_1               |               |
| ClusterAlias  | keyspace.-            | vt_keyspace   |
| ClusterAlias  | keyspace.-            |               |
| DataCenter    | datacenter_1          | vt_keyspace   |
| DataCenter    | datacenter_1          |               |
| PromotionRule | neutral               | vt_keyspace   |
| PromotionRule | neutral               |               |
+---------------+-----------------------+---------------+
9 rows in set (0.00 sec)

Note that the old build now inserted entries into this table without db_name specified, since it doesn't know about that column, and fell back onto the default value provided in the schema.

We patched up the tablet configuration regression on our end and tried to roll forward but tablets would not start due to the following error:

F0523 15:16:27.442595    6541 vttablet.go:130] NewActionAgent() failed: Duplicate entry 'Alias-vt_keyspace for key 'PRIMARY' (errno 1062) (sqlstate 23000) during query: UPDATE _vt.local_metadata SET db_name='vt_keyspace' WHERE db_name=''
failed to -init_populate_metadata
agent.InitTablet failed

It turns out that the unconditional updates to the rows in this table causes a primary key violation that the tablet is unable to recover from.

Proposed Workaround

Since this is preventing us from upgrading our tablets to the latest version, we intend on doing the following to get this table back into the correct state:

going into each master of the affected keyspace shards and dropping the _vt.local_metadata table
waiting for the table deletion to propagate to replicas
restarting tablets in each shard, taking care to do a reparent of the current master

This seems like the safest way to get this table in a consistent state without manually modifying its data and potentially messing up the replication stream. Since we use orchestrator, the only concern is that its view of the affected shards would be inconsistent while we do the rolling restart of the tablets. However, it seems like that inconsistency would be short lived and only affect the orc gui and not availability.

vitessio / vitess

local_metadata table left in inconsistent state between commits #4888

The Problem

Proposed Workaround