vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.47k stars 2.09k forks source link

Bug Report: Unique Constraints issues when consolidating shards with "hot" rows #14296

Open lizztheblizz opened 11 months ago

lizztheblizz commented 11 months ago

Overview of the Issue

When using MoveTables to consolidate shards, either by running SwitchTraffic while moving from 2 to 4 shards (where it affects the _reverse workflow), or during the normal operation of a MoveTables command moving from a 2-shard to a 4-shard keyspace, it is possible for vreplication to break because of unique keys, making it near impossible to recover.

Scenario:

Because the vstreams for -40 and 40-80 are executing independently from eachother, when consolidating these streams back into -80, it is very easy for one of those vstreams to fall behind the other, and fail perpetually thereafter. The error we observe is a unique constraint error on the MySQL side.

A workaround for the above scenario could be to run the vstream sessions with SET unique_checks=0;. This should be a safe operation in this case, because we still have our secondary lookup vindex table enforcing uniqueness across shards at the source side. I'm unclear if it would be considered safe in all unique constraint scenarios.

Reproduction Steps

  1. Using the following vschema:
    {
    "sharded": true,
    "vindexes": {
    "hash": {
      "type": "hash"
    },
    "uid_vdx": {
      "type": "consistent_lookup_unique",
      "params": {
        "from": "uid",
        "table": "ks.uid_vdx",
        "to": "keyspace_id"
      },
      "owner": "table1"
    },
    }
    "tables": {
    "table1": {
      "column_vindexes": [
        {
          "column": "id",
          "name": "hash"
        },
        {
          "column": "uid",
          "name": "uid_vdx"
        }
      ]
    },
    "uid_vdx": {
      "columnVindexes": [
        {
          "column": "uid",
          "name": "hash"
        }
      ]
    }
    }
    }
  2. Using the following schema:
    CREATE TABLE table1 (
    id int not null,
    uid int not null,
    primary key (id),
    unique key uid(uid)
    )

    Assume a sequence table exists in an unsharded keyspace to help populate id.

Binary Version

2023-09-26.0a5a570 (v16.0.3)

Operating System and Environment details

PlanetScaleDB k8s environment

Log Fragments

The following vreplication streams exist for workflow ks2.Reshard_reverse:

id=23 on -80/gcp_uscentral1c_1-3344908643: Status: Running. VStream Lag: -1s. Tx time: Fri Oct 13 21:59:30 2023.
id=24 on -80/gcp_uscentral1c_1-3344908643: Status: Error: Duplicate entry '25' for key 'table1.uid' (errno 1062) (sqlstate 23000) during query: insert into table1(id,uid) values (3,25).
id=23 on 80-/gcp_uscentral1c_1-0264185158: Status: Running. VStream Lag: -1s. Tx time: Fri Oct 13 21:59:31 2023.
id=24 on 80-/gcp_uscentral1c_1-0264185158: Status: Running. VStream Lag: -1s. Tx time: Fri Oct 13 21:59:31 2023.
brendar commented 10 months ago

We've run into this issue as well. It caused reverse VReplication to break when we ran SwitchTraffic during a reshard from 1 to 4 shards because the application has a workflow which re-associates a unique identifier (in our case an email address) from one record to another. Slack thread here: https://vitess.slack.com/archives/C0PQY0PTK/p1689620260468139

rohit-nayak-ps commented 2 weeks ago

unique_checks=0 does not mean we can have rows which break the uniqueness constraint: https://bugs.mysql.com/bug.php?id=44180

Not seeing any good Vitess option to fixing this. An external one seems to be to drop the unique keys explicitly and add them back after a SwitchTraffic.

lizztheblizz commented 1 week ago

Likely the beginning of a broader feature request, but wanted to add a quick response here:

Is there a world in which a single failing vstream, knowing it is being managed by a broader workflow, could choose to skip an event that generates a unique constraint error like this? I believe it ought to prompt a warning in the logs, and perhaps ideally, should also generate a log entry in _vt.vreplication_log (or maybe even in a separate, dedicated table for this), then fail the stream only after N number of events have failed consecutively for the same reason?

Tools like VDiff can help us confirm after the fact what, if any, damage was actually done by this "optimistic merging". And in the meantime, our MoveTables workflows would become perhaps a bit less fragile.

mattlord commented 1 week ago

My fear is that this would be TOO optimistic. Waiting for N days to run a VDiff only to see that there's some real drift.

We could e.g. turn an INSERT into an UPDATE and vice-versa so that we try and make the write idempotent. There's still the general issue of no total order and the good chance that at the end we aren't eventually consistent, but as noted, VDiff can help there so that you're at least aware of the differences. You can then try and manually fix things or retry the migration.

rohit-nayak-ps commented 2 days ago

Repeating much of what was said above, for illustrative purposes ...

The problem seen here is that we don’t have global event ordering.

For illustration, let us consider

The App executes the following to achieve the switch:

update admins set email = null where team_id = 2 on Shard 2 update admins set email = ‘b@example.com’ where team_id = 1on Shard 1 update admins set email = 'a@example.com' where team_id = 2 on Shard 2

On source we start with

(1, a@example.com)
(2, b@example.com)

Let's Shard 2's vplayer is lagging behind Shard 1, so

update admins set email = ‘b@example.com’ where team_id = 1

is the first of these three queries executed on the source shard. This breaks the unique key constraint, resulting in a ER_DUP_KEY.

One approach is to delay the failing stream optimistically, so that the lagging stream has time to catch up. On getting an ER_DUP_KEY for a merge workflow (defined by 2 or more _vt.vreplication rows with the same workflow_name, we trigger an optimistic wait loop.

We could wait for the pos of the other workflow to show progress before retrying, but that won’t scale if we have a lot of target shards. The easier approach might be to do a short exponential backoff starting with a small (100 ms?) wait unto say 5 seconds (or, longer since the altermative is a catastrophic failure, or, base it on vreplication lags of other streams ...)?

Of course, there will be situations where this might not work: for example, if there are lots of reassignments in a short time for a single team, in our example above. But hopefully we will catch a significant number of such issues allowing more workflows to proceed than they do now.