vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.23k stars 2.07k forks source link

Bug Report: vtcombo's DBDDL plugin has a problematic race condition when creating databases #13116

Closed jspawar closed 11 months ago

jspawar commented 1 year ago

Overview of the Issue

vtcombo implementation of the DBDDL plugin's logic used for CREATE DATABASE queries via the CreateDatabase method and how it's used in conjunction with vtcombo has a race condition that can repeatably cause failures to actually create a database and associated vitess entities (e.g. tablets).

Specifically, vtcombo's implementation of globalCreateDb here tries to read and increment the uid variable which needs to be a unique value (for a given cell) which is a race condition as parallel attempts to create a database end up reading the same value for the uid variable and then try to create a tablet with the same UID which is not allowed.

Some example output indicating as such:

22:54:07.227162       7 syslogger.go:129] foobar_test_efDbWJRM [keyspace] created value:
22:54:07.227203       7 locks.go:245] Locking keyspace foobar_test_efDbWJRM for action CreateShard
22:54:07.227250       7 syslogger.go:129] foobar_test_efDbWJRM/0 [shard] created value: is_primary_serving:true
22:54:07.227255       7 locks.go:284] Unlocking keyspace foobar_test_efDbWJRM for successful action CreateShard
22:54:07.234657       7 tablet_map.go:90] Creating PRIMARY tablet test-0000000028 for foobar_test_efDbWJRM/0
...
22:54:07.257710       7 syslogger.go:129] foobar_test_zCKdvTpX [keyspace] created value:
22:54:07.258008       7 locks.go:245] Locking keyspace foobar_test_zCKdvTpX for action CreateShard
22:54:07.258269       7 syslogger.go:129] foobar_test_zCKdvTpX/0 [shard] created value: is_primary_serving:true
22:54:07.258384       7 locks.go:284] Unlocking keyspace foobar_test_zCKdvTpX for successful action CreateShard
...
22:54:07.278880       7 tablet_map.go:90] Creating PRIMARY tablet test-0000000028 for foobar_test_zCKdvTpX/0
...
22:54:07.285483       7 vtgate.go:607] Execute: initTablet failed because existing tablet keyspace and shard foobar_test_efDbWJRM/0 differ from the provided ones foobar_test_zCKdvTpX/0, request: map[BindVariables:map[] Session:autocommit:true options:{included_fields:ALL workload:OLTP} row_count:-1 DDLStrategy:"direct" SessionUUID:"65c55f97-f373-11ed-97bd-0242ac160003" enable_system_settings:true Sql:CREATE DATABASE `foobar_test_zCKdvTpX`]

Can see that CREATE DATABASE foobar_test_zCKdvTpX failed above in the step in globalCreateDb which creates a tablet in the topology server because it re-used the UID of 28.

I've tested (and will subsequently propose via PR) a solution which simply modifies vtcombo's DBDDL plugin implementation to use a mutex which locks before executing globalCreateDb and releases the lock after that completes which effectively guarantees that access and incrementing of the uid variable is atomic in turn preventing the issue. This seems to have fixed the issue for me.

Reproduction Steps

  1. Start vtcombo up with --dbddl_plugin vttest (any other config should have no effect I believe)

    • I did so via the vitess/vttestserver Docker image and the following Docker Compose config:

      ```yaml version: '3' services: vtcombo: depends_on: mysql: condition: service_healthy image: vitess/vttestserver:mysql80 platform: linux/amd64 command: - sh - -c - | vtcombo \ --logtostderr=true \ --mysql_server_bind_address 0.0.0.0 \ --mysql_server_port 3306 \ --mysql_auth_server_impl none \ --db_host localhost \ --db_socket /var/run/mysqld/mysqld.sock \ --db_app_user root \ --db_allprivs_user root \ --db_appdebug_user root \ --db_dba_user root \ --db_repl_user root \ --tablet_refresh_interval 100ms \ --proto_topo 'keyspaces: > ' \ # vtcombo failed to start up with at least one keyspace, which is maybe a separate "bug"? --dbddl_plugin vttest ports: - "3306:3306" volumes: - varrunmysqld:/var/run/mysqld mysql: image: mysql:8.0 platform: linux/amd64 command: - --default-authentication-plugin=mysql_native_password - --max_connections=2000 # vitess makes a lot of connections during tests environment: - MYSQL_ALLOW_EMPTY_PASSWORD=yes healthcheck: test: "mysql -e 'select 1'" interval: 30s timeout: 1s start_period: 30s volumes: - varrunmysqld:/var/run/mysqld volumes: varrunmysqld: ```
  2. A bit inelegant but: attempt to execute in parallel multiple CREATE DATABASE <distinct_database_name_each_time> queries
    • One way I was able to reliably reproduce that was by running:
      for i in $(seq 1 10); do mysql -u root -h 127.0.0.1 -P 3306 -e "create database foobar_${i}" & done
  3. Observe that at least one or multiple of those queries fail, citing the error mentioned earlier where initTablet fails
    • Error from client's perspective is of the form:

      Error 1105: initTablet failed because existing tablet keyspace and shard foobar_1/0 differ from the provided ones foobar_2/0

Binary Version

$ docker run -it -u root vitess/vttestserver:mysql80 bash
root@f05f921b3f0a:/# vtcombo --version
ERROR: logging before flag.Parse: E0518 19:20:36.152999       7 syslogger.go:149] can't connect to syslog
Version: 17.0.0-SNAPSHOT (Git revision 5d991ac827a37715d0fa8944d0c9817d63e40efa branch 'main') built on Thu May 11 22:55:23 UTC 2023 by vitess@buildkitsandbox using go1.20.3 linux/amd64
root@f05f921b3f0a:/# exit
$ docker inspect vitess/vttestserver:mysql80 | jq .[].RepoDigests
[
  "vitess/vttestserver@sha256:edb0549aacd321a89083f2d57101c1f37765617b4ba6d8f2f53e7f3af07f9a56"
]

Operating System and Environment details

$ docker run -it -u root vitess/vttestserver:mysql80 bash
root@3e9bf17e9eb2:/# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 10 (buster)"
NAME="Debian GNU/Linux"
VERSION_ID="10"
VERSION="10 (buster)"
VERSION_CODENAME=buster
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
root@3e9bf17e9eb2:/# uname -sr
Linux 5.15.49-linuxkit
root@3e9bf17e9eb2:/# uname -m
x86_64

Log Fragments

22:54:07.227162       7 syslogger.go:129] foobar_test_efDbWJRM [keyspace] created value:
22:54:07.227203       7 locks.go:245] Locking keyspace foobar_test_efDbWJRM for action CreateShard
22:54:07.227250       7 syslogger.go:129] foobar_test_efDbWJRM/0 [shard] created value: is_primary_serving:true
22:54:07.227255       7 locks.go:284] Unlocking keyspace foobar_test_efDbWJRM for successful action CreateShard
22:54:07.234657       7 tablet_map.go:90] Creating PRIMARY tablet test-0000000028 for foobar_test_efDbWJRM/0
...
22:54:07.257710       7 syslogger.go:129] foobar_test_zCKdvTpX [keyspace] created value:
22:54:07.258008       7 locks.go:245] Locking keyspace foobar_test_zCKdvTpX for action CreateShard
22:54:07.258269       7 syslogger.go:129] foobar_test_zCKdvTpX/0 [shard] created value: is_primary_serving:true
22:54:07.258384       7 locks.go:284] Unlocking keyspace foobar_test_zCKdvTpX for successful action CreateShard
...
22:54:07.278880       7 tablet_map.go:90] Creating PRIMARY tablet test-0000000028 for foobar_test_zCKdvTpX/0
...
22:54:07.285483       7 vtgate.go:607] Execute: initTablet failed because existing tablet keyspace and shard foobar_test_efDbWJRM/0 differ from the provided ones foobar_test_zCKdvTpX/0, request: map[BindVariables:map[] Session:autocommit:true options:{included_fields:ALL workload:OLTP} row_count:-1 DDLStrategy:"direct" SessionUUID:"65c55f97-f373-11ed-97bd-0242ac160003" enable_system_settings:true Sql:CREATE DATABASE `foobar_test_zCKdvTpX`]
jspawar commented 1 year ago

[Draft] PR to address this: https://github.com/vitessio/vitess/pull/13117