vitessio / vitess

Vitess is a database clustering system for horizontal scaling of MySQL.
http://vitess.io
Apache License 2.0
18.19k stars 2.06k forks source link

vtorc/vttablet: can't downgrade from v20 to v19 #16300

Open derekperkins opened 1 week ago

derekperkins commented 1 week ago

Overview of the Issue

We're seeing vttablet OOM incredibly fast on v20.0.0 for some reason, after running fine for a couple weeks. We attempted to downgrade to v19.0.4 to see if that changed anything, but the primary was unable to start. vtorc attempted to recover UndoDemotePrimary and couldn't ever succeed.

SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 0) failed: Unknown system variable 'rpl_semi_sync_master_enabled'

When I reverted that change back to v20.0.0, vtorc was able to successfully run UndoDemotePrimary

Related issues:

Reproduction Steps

This was tested on a single node keyspace with only a single tablet.

Binary Version

v20.0.0 for most components
downgrading vttablet to v19.0.4

Operating System and Environment details

GKE v1.29

Log Fragments

I0628 23:33:10.017377       1 vtorc.go:205] Force discovered - &{Hostname:uscentral1-keywordsu-0-replica-c-0.vttablet Port:3306 InstanceAlias:uscentral1-0288422300 ServerID:1511431603 ServerUUID:e99beb6f-5ea1-11ee-9732-76800cdb5542 Version:8.0.36 VersionComment:Percona Server (GPL), Release '28', Revision '47601f19'$ FlavorName: ReadOnly:true BinlogFormat:ROW BinlogRowImage:FULL LogBinEnabled:true LogReplicationUpdatesEnabled:true SelfBinlogCoordinates:vt-0288422300-bin.033393:317 SourceHost: SourcePort:0 SourceUUID: AncestryUUID:e99beb6f-5ea1-11ee-9732-76800cdb5542 ReplicaNetTimeout:0 HeartbeatInterval:0 ReplicationSQLThreadRuning:false ReplicationIOThreadRuning:false ReplicationSQLThreadState:-1 ReplicationIOThreadState:-1 HasReplicationFilters:false GTIDMode:ON SupportsOracleGTID:true UsingOracleGTID:false UsingMariaDBGTID:false UsingPseudoGTID:false ReadBinlogCoordinates::0 ExecBinlogCoordinates::0 IsDetached:false RelaylogCoordinates::0 LastSQLError: LastIOError: SecondsBehindPrimary:{Int64:0 Valid:false} SQLDelay:0 ExecutedGtidSet:0c4b8ef3-253a-11ec-98d9-3235e408d1e5:1-609569,14e91046-1cdc-11ed-a65f-d60bf275c515:1-1589772,28d7debe-6d30-11ec-9696-826bd9c444b9:1-32830237,4dda5b4e-0ea6-11ed-a83b-c2d1e4afc67b:1-111205389,684e23fe-0eb4-11ed-a604-b6763a422c36:1-3484661,a351f01d-fb7d-11ed-a660-bee2b573f3bf:1-39363505,a5ef9c7b-0eac-11ed-8bbe-46781259a843:1-75497684,a8de8790-6c23-11ed-af0a-62dad3b25097:1-71206353,bd907242-5ea4-11ee-b501-aa7ed7190bc9:1-2490747,cd9e16c9-2539-11ec-9499-2a7ac907fc94:1-282508005,e99beb6f-5ea1-11ee-9732-76800cdb5542:1-1350063514 GtidPurged:0c4b8ef3-253a-11ec-98d9-3235e408d1e5:1-609569,14e91046-1cdc-11ed-a65f-d60bf275c515:1-1589772,28d7debe-6d30-11ec-9696-826bd9c444b9:1-32830237,4dda5b4e-0ea6-11ed-a83b-c2d1e4afc67b:1-111205389,684e23fe-0eb4-11ed-a604-b6763a422c36:1-3484661,a351f01d-fb7d-11ed-a660-bee2b573f3bf:1-39363505,a5ef9c7b-0eac-11ed-8bbe-46781259a843:1-75497684,a8de8790-6c23-11ed-af0a-62dad3b25097:1-71206353,bd907242-5ea4-11ee-b501-aa7ed7190bc9:1-2490747,cd9e16c9-2539-11ec-9499-2a7ac907fc94:1-282508005,e99beb6f-5ea1-11ee-9732-76800cdb5542:1-1345434303 GtidErrant: primaryExecutedGtidSet: ReplicationLagSeconds:{Int64:0 Valid:false} DataCenter:uscentral1 Region: PhysicalEnvironment: ReplicationDepth:0 IsCoPrimary:false HasReplicationCredentials:false SemiSyncEnforced:false SemiSyncPrimaryEnabled:false SemiSyncReplicaEnabled:false SemiSyncPrimaryTimeout:0 SemiSyncPrimaryWaitForReplicaCount:0 SemiSyncPrimaryStatus:false SemiSyncPrimaryClients:0 SemiSyncReplicaStatus:false LastSeenTimestamp: IsLastCheckValid:true IsUpToDate:true IsRecentlyChecked:true SecondsSinceLastSeen:{Int64:0 Valid:false} AllowTLS:false Problems:[] LastDiscoveryLatency:10.705519ms}, err - <nil>
I0628 23:33:10.017476       1 locks.go:455] Unlocking shard keywordsu/0 for action VTOrc Recovery for PrimaryIsReadOnly on uscentral1-0288422300 with error rpc error: code = Unknown desc = TabletManager.UndoDemotePrimary on uscentral1-0288422300: can't set semi-sync mode: ExecuteFetch(SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 0) failed: Unknown system variable 'rpl_semi_sync_master_enabled' (errno 1193) (sqlstate HY000) during query: SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 0; make sure plugins are loaded in my.cnf
E0628 23:33:10.022063       1 topology_recovery.go:654] rpc error: code = Unknown desc = TabletManager.UndoDemotePrimary on uscentral1-0288422300: can't set semi-sync mode: ExecuteFetch(SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 0) failed: Unknown system variable 'rpl_semi_sync_master_enabled' (errno 1193) (sqlstate HY000) during query: SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 0; make sure plugins are loaded in my.cnf
deepthi commented 5 days ago

@derekperkins what is your MySQL version? That is important in this case. The PR where we adopted the new commands is https://github.com/vitessio/vitess/pull/15907 and the way it's written, we'd expect this to work. We probably also need to see your cnf file because code thinks the semi-sync plugin has not been loaded.

derekperkins commented 5 days ago

We aren't using semi-sync, with durability-policy=none

Percona Server v8.0.36

# ENGINE SETTINGS #
default_storage_engine          = InnoDB
default-tmp-storage-engine      = InnoDB

# OTHER CONFIG #
default_authentication_plugin   = mysql_native_password
secure_file_priv                = NULL
explicit_defaults_for_timestamp = 1
group_concat_max_len            = 4M
event_scheduler                 = 0
symbolic-links                  = 0
character-set-server            = utf8mb4
collation-server                = utf8mb4_unicode_ci
binlog_format                   = ROW
binlog_row_image                = full
binlog_expire_logs_seconds      = 259200 # 3 days
sync-binlog                     = 0
binlog-transaction-compression  = ON
log-error-suppression-list      = MY-013360
slow_query_log                  = OFF

# PERCONA
binlog_space_limit              = 10G
userstat                        = OFF
long_query_time                 = 0
log_slow_rate_type              = query
log_slow_verbosity              = full
log_slow_rate_limit             = 100
max_slowlog_size                = 1G
slow_query_log_always_write_time  = 2
slow_query_log_use_global_control = all

# CACHES AND LIMITS #
tmp-table-size                  = 32M
max-heap-table-size             = 32M
max-connections                 = 2500
thread-cache-size               = 50
open-files-limit                = 65535
table-definition-cache          = 4096
table-open-cache                = 4096

# INNODB #
innodb-flush-method             = O_DIRECT
innodb-log-files-in-group       = 2
innodb-log-file-size            = 4G
innodb-flush-log-at-trx-commit  = 2
innodb_buffer_pool_instances    = 10
innodb_buffer_pool_chunk_size   = 1G
innodb-buffer-pool-size         = 10G
innodb_lock_wait_timeout        = 300
innodb_io_capacity              = 2000
innodb_io_capacity_max          = 4000
innodb_lru_scan_depth           = 2000
innodb_flush_neighbors          = 0
innodb_read_io_threads          = 16
innodb_write_io_threads         = 16
innodb_purge_threads            = 4
deepthi commented 5 days ago

OK. attempting to set semi-sync properties with a durability policy of "none" is clearly a bug. We'll need to fix this, and backport to release-20.0

derekperkins commented 5 days ago

v19 ran fine with these settings for months, and v20 runs fine now, I just wasn't able to downgrade to v19

derekperkins commented 5 days ago

And FWIW, I don't need to downgrade anymore, so this isn't urgent from my perspective.

GuptaManan100 commented 5 days ago

Hello @derekperkins! This problem seems to be occurring because of https://github.com/vitessio/vitess/pull/15791. I don't see anything else that changed between v20 and v19.

In order to debug this properly could you tell me the outputs of running the following 2 queries in your MySQL -

SHOW VARIABLES LIKE 'rpl_semi_sync_%_enabled'
SELECT COUNT(*) > 0 AS plugin_loaded FROM information_schema.plugins WHERE plugin_name LIKE 'rpl_semi_sync%'
deepthi commented 4 days ago

this isn't urgent from my perspective.

We are treating this as semi-urgent because it breaks upgrade/downgrade for the "none" DurabilityPolicy

derekperkins commented 4 days ago
+-------------------------------+-------+
| Variable_name                 | Value |
+-------------------------------+-------+
| rpl_semi_sync_replica_enabled | OFF   |
| rpl_semi_sync_source_enabled  | OFF   |
+-------------------------------+-------+

+---------------+
| plugin_loaded |
+---------------+
| 1             |
+---------------+
GuptaManan100 commented 3 days ago

Okay, this is very interesting. From the outputs that you show, it looks like the plugin for semi-sync is loaded. So, the code in v19, checks that the plugin is loaded, and if it is, then it sets the semi-sync settings with SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 0 In v20, we added the code to change the semi-sync command based on which plugin is loaded but that doesn't exist in v19. Did you by any chance upgrade your MySQL version too, because according to the docs -

From MySQL 8.0.26, new versions of the source and replica plugins are supplied, which replace the terms “master” and “slave” with “source” and “replica” in system variables and status variables. 
If you install the new rpl_semi_sync_source and rpl_semi_sync_replica plugins, the new system variables and status variables are available but the old ones are not. 
If you install the old rpl_semi_sync_master and rpl_semi_sync_slave plugins, the old system variables and status variables are available but the new ones are not. 
You cannot have both the new and the old version of the relevant plugin installed on an instance.

After upgrading to v20, if you upgrade your MySQL version to 8.0.26 or above, then I guess it is not possible to downgrade back to v19.

derekperkins commented 3 days ago

We didn't change MySQL versions at any point during this upgrade. We upgraded from v8.0.34 to v8.0.36 while on v19. We've been >= 8.0.26 since Nov 2021 and Vitess v11

GuptaManan100 commented 3 days ago

Did you by any chance change the plugin that was being loaded?

plugin-load = rpl_semi_sync_master=semisync_master.so;rpl_semi_sync_slave=semisync_slave.so

VS

plugin-load = rpl_semi_sync_source=semisync_source.so;rpl_semi_sync_replica=semisync_replica.so
GuptaManan100 commented 3 days ago

Oh, maybe you didn't do anything explicitly and it just happened implicitly because mysqlctl after the changes in https://github.com/vitessio/vitess/pull/15791 would start using the new mycnf file for versions over 8.0.26 that loads the new plugin. Let me look into when we use a my.cnf file. Maybe there is a flow that causes Vitess to start using the new my.cnf file which loads the new plugin, but when you downgrade, then the old my.cnf file doesn't take effect 🤷‍♂️. I'll look into when we re-initalize my.cnf files and maybe that will tell us more.

For an immediate workaround though, for anyone facing this issue. To downgrade, just load the old plugins instead of the new ones in my.cnf and restart mysql so that they take effect.

GuptaManan100 commented 3 days ago

I looked into this further today, and here is what I found. We reinitialize the my.cnf file in the following cases -

  1. init, init-config, reinit_config etc commands of mysqlctl.
  2. Whenever mysqlctld is started, it either creates the new my.cnf file or it updates the old one incase the generated one is different
  3. Whenever a restore happens, and we remove all the old files and reinitialize mysql that causes a new my.cnf to be generated.

So, what I believe happened is as follows -

  1. You were running all the vitess components at v19 and the my.cnf was loading the old plugin values and everything was working as intended.
  2. Then you upgraded everything to v20, (including mysqlctl I'm assuming) and that would have caused a change to the my.cnf file to start using the new plugin values. Since v20 vttablet supports both the plugin variants, everything was still working fine.
  3. Some issue caused you to downgrade vttablet to v19 (but not mysqlctl, I'm assuming), and when this happened, your my.cnf was still using the new plugin, and v19 of vttablet doesn't support it!

@derekperkins Could you let me know if ☝️ if this is the correct sequence of operations?

If it is, then the fix is to also downgrade mysqlctl which would change the my.cnf to use old plugins again. We'll have to backport some part of https://github.com/vitessio/vitess/pull/15791 to v19 and do a patch or we'll have to do some changes in v20 so that it doesn't use the new plugins immediately and reintroduce the change in v21.