We were trying to move from this release, which doesn't have any of the aforementioned changes, to one that does contain
them. When we did that, the state our tablet _vt.local_metadata records were something like this:
mysql> select * from _vt.local_metadata;
+---------------+-----------------------+---------------+
| name | value | db_name |
+---------------+-----------------------+---------------+
| Alias | alias_1 | vt_keyspace |
| ClusterAlias | keyspace .- | vt_keyspace |
| DataCenter | datacenter_1 | vt_keyspace |
| PromotionRule | neutral | vt_keyspace |
+---------------+-----------------------+---------------+
4 rows in set (0.00 sec)
with a schema like:
CREATE TABLE `local_metadata` (
`name` varchar(255) NOT NULL,
`value` varchar(255) NOT NULL,
`db_name` varbinary(255) NOT NULL DEFAULT '',
PRIMARY KEY (`name`,`db_name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
Note that the old build now inserted entries into this table without db_name specified, since it doesn't know about that column, and fell back onto the default value provided in the schema.
We patched up the tablet configuration regression on our end and tried to roll forward but tablets would not start due to the following error:
F0523 15:16:27.442595 6541 vttablet.go:130] NewActionAgent() failed: Duplicate entry 'Alias-vt_keyspace for key 'PRIMARY' (errno 1062) (sqlstate 23000) during query: UPDATE _vt.local_metadata SET db_name='vt_keyspace' WHERE db_name=''
failed to -init_populate_metadata
agent.InitTablet failed
It turns out that the unconditional updates to the rows in this table causes a primary key violation that the tablet is unable to recover from.
Proposed Workaround
Since this is preventing us from upgrading our tablets to the latest version, we intend on doing the following to get this table back into the correct state:
going into each master of the affected keyspace shards and dropping the _vt.local_metadata table
waiting for the table deletion to propagate to replicas
restarting tablets in each shard, taking care to do a reparent of the current master
This seems like the safest way to get this table in a consistent state without manually modifying
its data and potentially messing up the replication stream. Since we use orchestrator, the only
concern is that its view of the affected shards would be inconsistent while we do the rolling
restart of the tablets. However, it seems like that inconsistency would be short lived and only
affect the orc gui and not availability.
The Problem
With the changes introduced to the
_vt.local_metadata
table between https://github.com/vitessio/vitess/pull/4727 and https://github.com/vitessio/vitess/pull/4830, we are unable to go back and forth between versions of vttablet without putting the_vt.local_metadata
in an inconsistent state.We were trying to move from this release, which doesn't have any of the aforementioned changes, to one that does contain them. When we did that, the state our tablet
_vt.local_metadata
records were something like this:with a schema like:
After we found a separate regression with tablet configuration, we had to roll this build back, resulting in this same table now looking like this:
Note that the old build now inserted entries into this table without
db_name
specified, since it doesn't know about that column, and fell back onto the default value provided in the schema.We patched up the tablet configuration regression on our end and tried to roll forward but tablets would not start due to the following error:
It turns out that the unconditional updates to the rows in this table causes a primary key violation that the tablet is unable to recover from.
Proposed Workaround
Since this is preventing us from upgrading our tablets to the latest version, we intend on doing the following to get this table back into the correct state:
_vt.local_metadata
tableThis seems like the safest way to get this table in a consistent state without manually modifying its data and potentially messing up the replication stream. Since we use orchestrator, the only concern is that its view of the affected shards would be inconsistent while we do the rolling restart of the tablets. However, it seems like that inconsistency would be short lived and only affect the orc gui and not availability.