vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.48k stars 2.09k forks source link

InitShardMaster commands fails but still elects a master tablet #6390

Closed ankitthakwani closed 1 year ago

ankitthakwani commented 4 years ago

Issue: I bootstrapped a shard by initiating a tablet to be master using the command "vtctlclient -server ipmsservicemysql-ipms-v1-2021:15999 InitShardMaster -force oms_switch_unsharded/0 ctrls-100" The command failed with error but still promoted the tablet as master.

[vitess@ipmsservicemysql-ipms-v1-2021 vt]$ vttablet -version Version: bfd2ac4 (Git branch 'master') built on Wed Jun 24 12:03:05 IST 2020 by vitess@ipmsservicemysql-upi-v3-73138 using go1.14.1 linux/amd64

Steps to Reproduce the issue

STEP 1: Induce an issue by which replication will not be setup successfully, here I am setting up the password with length >32 character(limit for MySQL replication) Set up a vitess keyspace with 3 replica tablets.

[vitess@ipmsservicemysql-ipms-v1-2021 vt]$ grep vt_repl /data/vitessBaseDir/vitessCustomConfig/init_db.sql
  CREATE USER 'vt_repl'@'%' IDENTIFIED BY 'Pass@12333333333333333333333333333333333';
  GRANT REPLICATION SLAVE ON *.* TO 'vt_repl'@'%';

  [vitess@ipmsservicemysql-ipms-v1-2021 vt]$ mysql -uvt_repl -p -S $VTDATAROOT/vt_0000000102/mysql.sock -e "SHOW GRANTS"
  Enter password: 
  +-------------------------------------------------+
  | Grants for vt_repl@%                            |
  +-------------------------------------------------+
  | GRANT REPLICATION SLAVE ON *.* TO `vt_repl`@`%` |
  +-------------------------------------------------+

STEP 2: Elect a tablet as master from one of the three replica tablets

 [vitess@ipmsservicemysql-ipms-v1-2021 vt]$ vtctlclient -server ipmsservicemysql-ipms-v1-2021:15999 ListAllTablets ctrls
  ctrls-0000000100 oms_switch_unsharded 0 replica ipmsservicemysql-ipms-v1-2021:15100 ipmsservicemysql-ipms-v1-2021:17100 [] <null>
  ctrls-0000000101 oms_switch_unsharded 0 replica ipmsservicemysql-ipms-v1-2021:15101 ipmsservicemysql-ipms-v1-2021:17101 [] <null>
  ctrls-0000000102 oms_switch_unsharded 0 replica ipmsservicemysql-ipms-v1-2021:15102 ipmsservicemysql-ipms-v1-2021:17102 [] <null>

  [vitess@ipmsservicemysql-ipms-v1-2021 vt]$ date
  Sat Jun 27 02:11:13 IST 2020

  [vitess@ipmsservicemysql-ipms-v1-2021 vt]$ vtctlclient -server ipmsservicemysql-ipms-v1-2021:15999 InitShardMaster -force oms_switch_unsharded/0 ctrls-100 
  W0627 02:11:14.970841    6782 main.go:64] W0626 20:41:14.970231 reparent.go:185] master-elect tablet ctrls-0000000100 is not the shard master, proceeding anyway as -force was used
  W0627 02:11:14.971461    6782 main.go:64] W0626 20:41:14.970553 reparent.go:191] master-elect tablet ctrls-0000000100 is not a master in the shard, proceeding anyway as -force was used
  W0627 02:11:45.006198    6782 main.go:64] W0626 20:41:45.005962 reparent.go:295] master failed to PopulateReparentJournal, canceling slaves
  E0627 02:11:45.008878    6782 main.go:67] remote error: rpc error: code = Unknown desc = failed to PopulateReparentJournal on master: rpc error: code = DeadlineExceeded desc = context deadline exceeded

STEP 3: Observe that even though the InitShardMaster failed, the tablet is marked as master

[vitess@ipmsservicemysql-ipms-v1-2021 tmp]$ vtctlclient -server ipmsservicemysql-ipms-v1-2021:15999 ListAllTablets ctrls
  ctrls-0000000100 oms_switch_unsharded 0 master ipmsservicemysql-ipms-v1-2021:15100 ipmsservicemysql-ipms-v1-2021:17100 [] 2020-06-26T20:41:15Z
  ctrls-0000000101 oms_switch_unsharded 0 replica ipmsservicemysql-ipms-v1-2021:15101 ipmsservicemysql-ipms-v1-2021:17101 [] <null>
  ctrls-0000000102 oms_switch_unsharded 0 replica ipmsservicemysql-ipms-v1-2021:15102 ipmsservicemysql-ipms-v1-2021:17102 [] <null>

  [vitess@ipmsservicemysql-ipms-v1-2021 tmp]$ vtctlclient -server ipmsservicemysql-ipms-v1-2021:15999 ShardReplicationPositions oms_switch_unsharded/0
  ctrls-0000000102 oms_switch_unsharded 0 replica ipmsservicemysql-ipms-v1-2021:15102 ipmsservicemysql-ipms-v1-2021:17102 [] <null> <err> <err> <err>
  ctrls-0000000101 oms_switch_unsharded 0 replica ipmsservicemysql-ipms-v1-2021:15101 ipmsservicemysql-ipms-v1-2021:17101 [] <null> <err> <err> <err>
  ctrls-0000000100 oms_switch_unsharded 0 master ipmsservicemysql-ipms-v1-2021:15100 ipmsservicemysql-ipms-v1-2021:17100 [] 2020-06-26T20:41:15Z MySQL56/31fec863-b7ed-11ea-873b-f898ef71c489:1-2 0

  [vitess@ipmsservicemysql-ipms-v1-2021 tmp]$ vtctlclient -server ipmsservicemysql-ipms-v1-2021:15999 GetShard oms_switch_unsharded/0
  {
    "master_alias": {
      "cell": "ctrls",
      "uid": 100
    },
    "master_term_start_time": {
      "seconds": "1593204075",
      "nanoseconds": 432330
    },
    "key_range": null,
    "served_types": [
    ],
    "source_shards": [
    ],
    "tablet_controls": [
    ],
    "is_master_serving": true
  }

VTCTLD LOGS

I0627 02:11:07.180269    3414 zk_conn.go:336] zk conn: session for addr ipmsservicemysql-ipms-v1-2021:21811,ipmsservicemysql-ipms-v1-2021:21812,ipmsservicemysql-ipms-v1-2021:21813 event: {EventSession StateHasSession  <nil> 10.100.18.21:21813}
  I0627 02:11:14.965497    3414 locks.go:354] Locking shard oms_switch_unsharded/0 for action InitShardMaster(ctrls-0000000100)
  W0627 02:11:14.970270    3414 reparent.go:185] master-elect tablet ctrls-0000000100 is not the shard master, proceeding anyway as -force was used
  W0627 02:11:14.970570    3414 reparent.go:191] master-elect tablet ctrls-0000000100 is not a master in the shard, proceeding anyway as -force was used
  I0627 02:11:14.970688    3414 reparent.go:222] resetting replication on tablet ctrls-0000000101
  I0627 02:11:14.970808    3414 reparent.go:222] resetting replication on tablet ctrls-0000000102
  I0627 02:11:14.970905    3414 reparent.go:222] resetting replication on tablet ctrls-0000000100
  I0627 02:11:14.988760    3414 reparent.go:241] initializing master on ctrls-0000000100
  I0627 02:11:15.005938    3414 reparent.go:281] initializing slave ctrls-0000000101
  I0627 02:11:15.006043    3414 reparent.go:281] initializing slave ctrls-0000000102
  I0627 02:11:15.006622    3414 reparent.go:274] populating reparent journal on new master ctrls-0000000100
  W0627 02:11:45.005995    3414 reparent.go:295] master failed to PopulateReparentJournal, canceling slaves
  I0627 02:11:45.006111    3414 locks.go:391] Unlocking shard oms_switch_unsharded/0 for action InitShardMaster(ctrls-0000000100) with error failed to PopulateReparentJournal on master: rpc error: code = DeadlineExceeded desc = context deadline exceeded

VTTABLET Replica LOGS

I0627 02:10:56.653956    6425 replication_reporter.go:73] Failed to reconnect to master: no master tablet for shard oms_switch_unsharded/0
  I0627 02:11:01.655604    6425 replication_reporter.go:70] Slave is stopped. Trying to reconnect to master...
  I0627 02:11:01.655896    6425 replication_reporter.go:73] Failed to reconnect to master: no master tablet for shard oms_switch_unsharded/0
  I0627 02:11:06.656786    6425 replication_reporter.go:70] Slave is stopped. Trying to reconnect to master...
  I0627 02:11:06.657318    6425 replication_reporter.go:73] Failed to reconnect to master: no master tablet for shard oms_switch_unsharded/0
  I0627 02:11:11.657937    6425 replication_reporter.go:70] Slave is stopped. Trying to reconnect to master...
  I0627 02:11:11.658266    6425 replication_reporter.go:73] Failed to reconnect to master: no master tablet for shard oms_switch_unsharded/0
  I0627 02:11:14.976881    6425 query.go:73] exec STOP SLAVE
  I0627 02:11:14.977032    6425 query.go:73] exec RESET SLAVE ALL
  I0627 02:11:14.979485    6425 query.go:73] exec RESET MASTER
  I0627 02:11:14.987971    6425 query.go:73] exec SET GLOBAL rpl_semi_sync_master_enabled = false, GLOBAL rpl_semi_sync_slave_enabled = false
  I0627 02:11:14.988220    6425 rpc_server.go:95] TabletManager.ResetReplication()(on ctrls-0000000102 from ): (*tabletmanagerdata.ResetReplicationResponse)(nil)
  I0627 02:11:15.007818    6425 replication.go:419] Setting semi-sync mode: master=false, slave=true
  I0627 02:11:15.007953    6425 query.go:73] exec SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 1
  I0627 02:11:15.008181    6425 replication.go:272] Executing commands to set slave position: [RESET MASTER SET GLOBAL gtid_purged = '31fec863-b7ed-11ea-873b-f898ef71c489:1-2']
  I0627 02:11:15.008190    6425 query.go:73] exec RESET MASTER
  I0627 02:11:15.015615    6425 query.go:73] exec SET GLOBAL gtid_purged = '31fec863-b7ed-11ea-873b-f898ef71c489:1-2'
  I0627 02:11:15.015996    6425 query.go:73] exec CHANGE MASTER TO
    MASTER_HOST = 'ipmsservicemysql-ipms-v1-2021',
    MASTER_PORT = 17100,
    MASTER_USER = 'vt_repl',
    MASTER_PASSWORD = '****************************************',
    MASTER_CONNECT_RETRY = 1,
    MASTER_AUTO_POSITION = 1
  W0627 02:11:15.016128    6425 rpc_server.go:91] TabletManager.InitSlave(parent:<cell:"ctrls" uid:100 > replication_position:"MySQL56/31fec863-b7ed-11ea-873b-f898ef71c489:1-2" time_created_ns:1593204075005842919 )(on ctrls-0000000102 from ) error: ExecuteFetch(CHANGE MASTER TO
    MASTER_HOST = 'ipmsservicemysql-ipms-v1-2021',
    MASTER_PORT = 17100,
    MASTER_USER = 'vt_repl',
    MASTER_PASSWORD = '****************************************',
    MASTER_CONNECT_RETRY = 1,
    MASTER_AUTO_POSITION = 1) failed: The password provided for the replication user exceeds the maximum length of 32 characters (errno 3056) (sqlstate HY000) during query: CHANGE MASTER TO
    MASTER_HOST = 'ipmsservicemysql-ipms-v1-2021',
    MASTER_PORT = 17100,
    MASTER_USER = 'vt_repl',
    MASTER_PASSWORD = '****************************************',
    MASTER_CONNECT_RETRY = 1,
    MASTER_AUTO_POSITION = 1
  I0627 02:11:16.658917    6425 replication_reporter.go:70] Slave is stopped. Trying to reconnect to master...
  I0627 02:11:16.660926    6425 replication_reporter.go:73] Failed to reconnect to master: error calling Orchestrator API: Get "http://orchestrator-service:4040/api/audit-recovery/alias/oms_switch_unsharded.0": dial tcp: lookup orchestrator-service on 10.100.102.21:53: no such host

VTTABLET MASTER LOGS

I0627 02:11:06.464327    4414 replication_reporter.go:70] Slave is stopped. Trying to reconnect to master...
 I0627 02:11:06.464869    4414 replication_reporter.go:73] Failed to reconnect to master: no master tablet for shard oms_switch_unsharded/0
 I0627 02:11:11.465512    4414 replication_reporter.go:70] Slave is stopped. Trying to reconnect to master...
 I0627 02:11:11.465795    4414 replication_reporter.go:73] Failed to reconnect to master: no master tablet for shard oms_switch_unsharded/0
 I0627 02:11:14.977026    4414 query.go:73] exec STOP SLAVE
 I0627 02:11:14.977179    4414 query.go:73] exec RESET SLAVE ALL
 I0627 02:11:14.978443    4414 query.go:73] exec RESET MASTER
 I0627 02:11:14.987971    4414 query.go:73] exec SET GLOBAL rpl_semi_sync_master_enabled = false, GLOBAL rpl_semi_sync_slave_enabled = false
 I0627 02:11:14.988137    4414 rpc_server.go:95] TabletManager.ResetReplication()(on ctrls-0000000100 from ): (*tabletmanagerdata.ResetReplicationResponse)(nil)
 I0627 02:11:14.989856    4414 query.go:73] exec CREATE DATABASE IF NOT EXISTS _vt
 I0627 02:11:14.990112    4414 query.go:73] exec CREATE TABLE IF NOT EXISTS _vt.reparent_journal (
   time_created_ns BIGINT UNSIGNED NOT NULL,
   action_name VARBINARY(250) NOT NULL,
   master_alias VARBINARY(32) NOT NULL,
   replication_position VARBINARY(64000) DEFAULT NULL,
   PRIMARY KEY (time_created_ns))
 ENGINE=InnoDB
 I0627 02:11:14.999536    4414 replication.go:419] Setting semi-sync mode: master=true, slave=true
 I0627 02:11:14.999627    4414 query.go:73] exec SET GLOBAL rpl_semi_sync_master_enabled = 1, GLOBAL rpl_semi_sync_slave_enabled = 1
 I0627 02:11:15.000324    4414 query.go:73] exec SET GLOBAL read_only = OFF
 I0627 02:11:15.001970    4414 state_change.go:174] Running tablet callback because: ChangeType
 I0627 02:11:15.002018    4414 shard_sync.go:76] Change to tablet state
 I0627 02:11:15.002382    4414 tabletserver.go:351] TabletServer state: NOT_SERVING (Not Connected) -> NOT_SERVING (Transitioning)
 E0627 02:11:15.003481    4414 tabletserver.go:538] error creating db app connection: Unknown database 'vt_oms_switch_unsharded' (errno 1049) (sqlstate 42000)
 I0627 02:11:15.003550    4414 tx_engine.go:389] No grace period specified: performing normal wait.
 I0627 02:11:15.003564    4414 tabletserver.go:351] TabletServer state: NOT_SERVING (Transitioning) -> NOT_SERVING (Not Connected)
 E0627 02:11:15.003572    4414 state_change.go:310] Can't start query service for MASTER+REPLICA mode: Unknown database 'vt_oms_switch_unsharded' (errno 1049) (sqlstate 42000)
 I0627 02:11:15.003587    4414 tabletserver.go:351] TabletServer state: NOT_SERVING (Not Connected) -> NOT_SERVING (Transitioning)
 E0627 02:11:15.003932    4414 tabletserver.go:538] error creating db app connection: Unknown database 'vt_oms_switch_unsharded' (errno 1049) (sqlstate 42000)
 I0627 02:11:15.003988    4414 tx_engine.go:389] No grace period specified: performing normal wait.
 I0627 02:11:15.004000    4414 tabletserver.go:351] TabletServer state: NOT_SERVING (Transitioning) -> NOT_SERVING (Not Connected)
 E0627 02:11:15.004005    4414 state_change.go:324] Cannot start query service: Unknown database 'vt_oms_switch_unsharded' (errno 1049) (sqlstate 42000)
 I0627 02:11:15.004027    4414 engine.go:147] Starting VReplication engine
 E0627 02:11:15.004409    4414 state_change.go:367] Could not start VReplication engine: error in connecting to mysql db with connection <nil>, err Unknown database 'vt_oms_switch_unsharded' (errno 1049) (sqlstate 42000). Will keep retrying at health check intervals.
 I0627 02:11:15.004444    4414 state_change.go:384] Publishing state: alias:<cell:"ctrls" uid:100 > hostname:"ipmsservicemysql-ipms-v1-2021" port_map:<key:"grpc" value:16100 > port_map:<key:"mysql" value:17100 > port_map:<key:"vt" value:15100 > keyspace:"oms_switch_unsharded" shard:"0" type:MASTER mysql_hostname:"ipmsservicemysql-ipms-v1-2021" mysql_port:17100 master_term_start_time:<seconds:1593204075 nanoseconds:432330 > 
 I0627 02:11:15.004489    4414 engine.go:249] Shutting down VReplication engine
 I0627 02:11:15.004520    4414 shard_sync.go:76] Change to tablet state
 I0627 02:11:15.005519    4414 rpc_server.go:95] TabletManager.InitMaster()(on ctrls-0000000100 from ): (*tabletmanagerdata.InitMasterResponse)(nil)
 I0627 02:11:15.005748    4414 shard_sync.go:172] Updating shard record: master_alias=ctrls-0000000100, master_term_start_time=2020-06-27 02:11:15.00043233 +0530 IST m=+73.728748282
 I0627 02:11:15.006588    4414 shard_watcher.go:39] Starting shard watch of oms_switch_unsharded/0
 I0627 02:11:15.007953    4414 query.go:73] exec CREATE DATABASE IF NOT EXISTS _vt
 I0627 02:11:16.466180    4414 tabletserver.go:351] TabletServer state: NOT_SERVING (Not Connected) -> NOT_SERVING (Transitioning)
 E0627 02:11:16.466803    4414 tabletserver.go:538] error creating db app connection: Unknown database 'vt_oms_switch_unsharded' (errno 1049) (sqlstate 42000)
 I0627 02:11:16.466914    4414 tx_engine.go:389] No grace period specified: performing normal wait.
 I0627 02:11:16.466929    4414 tabletserver.go:351] TabletServer state: NOT_SERVING (Transitioning) -> NOT_SERVING (Not Connected)
 I0627 02:11:16.466937    4414 engine.go:147] Starting VReplication engine
 I0627 02:11:16.467327    4414 engine.go:249] Shutting down VReplication engine
 I0627 02:11:21.467451    4414 tabletserver.go:351] TabletServer state: NOT_SERVING (Not Connected) -> NOT_SERVING (Transitioning)
 E0627 02:11:21.468108    4414 tabletserver.go:538] error creating db app connection: Unknown database 'vt_oms_switch_unsharded' (errno 1049) (sqlstate 42000)
 I0627 02:11:21.468190    4414 tx_engine.go:389] No grace period specified: performing normal wait.
 I0627 02:11:21.468205    4414 tabletserver.go:351] TabletServer state: NOT_SERVING (Transitioning) -> NOT_SERVING (Not Connected). 
deepthi commented 4 years ago

From slack thread: rerunning ISM after fixing the underlying mysql error produces a different error:

E0627 02:51:46.989567   20095 main.go:67] remote error: rpc error: code = Unknown desc = tablet ctrls-0000000100 ResetReplication failed (either fix it, or Scrap it): rpc error: code = DeadlineExceeded desc = context deadline exceeded
mattlord commented 1 year ago

@GuptaManan100 is this still relevant or should we close it?

GuptaManan100 commented 1 year ago

Yes, we can close it 👍