Open derekperkins opened 1 week ago
@derekperkins what is your MySQL version? That is important in this case. The PR where we adopted the new commands is https://github.com/vitessio/vitess/pull/15907 and the way it's written, we'd expect this to work. We probably also need to see your cnf file because code thinks the semi-sync plugin has not been loaded.
We aren't using semi-sync, with durability-policy=none
Percona Server v8.0.36
# ENGINE SETTINGS #
default_storage_engine = InnoDB
default-tmp-storage-engine = InnoDB
# OTHER CONFIG #
default_authentication_plugin = mysql_native_password
secure_file_priv = NULL
explicit_defaults_for_timestamp = 1
group_concat_max_len = 4M
event_scheduler = 0
symbolic-links = 0
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
binlog_format = ROW
binlog_row_image = full
binlog_expire_logs_seconds = 259200 # 3 days
sync-binlog = 0
binlog-transaction-compression = ON
log-error-suppression-list = MY-013360
slow_query_log = OFF
# PERCONA
binlog_space_limit = 10G
userstat = OFF
long_query_time = 0
log_slow_rate_type = query
log_slow_verbosity = full
log_slow_rate_limit = 100
max_slowlog_size = 1G
slow_query_log_always_write_time = 2
slow_query_log_use_global_control = all
# CACHES AND LIMITS #
tmp-table-size = 32M
max-heap-table-size = 32M
max-connections = 2500
thread-cache-size = 50
open-files-limit = 65535
table-definition-cache = 4096
table-open-cache = 4096
# INNODB #
innodb-flush-method = O_DIRECT
innodb-log-files-in-group = 2
innodb-log-file-size = 4G
innodb-flush-log-at-trx-commit = 2
innodb_buffer_pool_instances = 10
innodb_buffer_pool_chunk_size = 1G
innodb-buffer-pool-size = 10G
innodb_lock_wait_timeout = 300
innodb_io_capacity = 2000
innodb_io_capacity_max = 4000
innodb_lru_scan_depth = 2000
innodb_flush_neighbors = 0
innodb_read_io_threads = 16
innodb_write_io_threads = 16
innodb_purge_threads = 4
OK. attempting to set semi-sync properties with a durability policy of "none" is clearly a bug. We'll need to fix this, and backport to release-20.0
v19 ran fine with these settings for months, and v20 runs fine now, I just wasn't able to downgrade to v19
And FWIW, I don't need to downgrade anymore, so this isn't urgent from my perspective.
Hello @derekperkins! This problem seems to be occurring because of https://github.com/vitessio/vitess/pull/15791. I don't see anything else that changed between v20 and v19.
In order to debug this properly could you tell me the outputs of running the following 2 queries in your MySQL -
SHOW VARIABLES LIKE 'rpl_semi_sync_%_enabled'
SELECT COUNT(*) > 0 AS plugin_loaded FROM information_schema.plugins WHERE plugin_name LIKE 'rpl_semi_sync%'
this isn't urgent from my perspective.
We are treating this as semi-urgent because it breaks upgrade/downgrade for the "none" DurabilityPolicy
+-------------------------------+-------+
| Variable_name | Value |
+-------------------------------+-------+
| rpl_semi_sync_replica_enabled | OFF |
| rpl_semi_sync_source_enabled | OFF |
+-------------------------------+-------+
+---------------+
| plugin_loaded |
+---------------+
| 1 |
+---------------+
Okay, this is very interesting. From the outputs that you show, it looks like the plugin for semi-sync is loaded.
So, the code in v19, checks that the plugin is loaded, and if it is, then it sets the semi-sync settings with SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 0
In v20, we added the code to change the semi-sync command based on which plugin is loaded but that doesn't exist in v19.
Did you by any chance upgrade your MySQL version too, because according to the docs -
From MySQL 8.0.26, new versions of the source and replica plugins are supplied, which replace the terms “master” and “slave” with “source” and “replica” in system variables and status variables.
If you install the new rpl_semi_sync_source and rpl_semi_sync_replica plugins, the new system variables and status variables are available but the old ones are not.
If you install the old rpl_semi_sync_master and rpl_semi_sync_slave plugins, the old system variables and status variables are available but the new ones are not.
You cannot have both the new and the old version of the relevant plugin installed on an instance.
After upgrading to v20, if you upgrade your MySQL version to 8.0.26
or above, then I guess it is not possible to downgrade back to v19.
We didn't change MySQL versions at any point during this upgrade. We upgraded from v8.0.34 to v8.0.36 while on v19. We've been >= 8.0.26 since Nov 2021 and Vitess v11
Did you by any chance change the plugin that was being loaded?
plugin-load = rpl_semi_sync_master=semisync_master.so;rpl_semi_sync_slave=semisync_slave.so
VS
plugin-load = rpl_semi_sync_source=semisync_source.so;rpl_semi_sync_replica=semisync_replica.so
Oh, maybe you didn't do anything explicitly and it just happened implicitly because mysqlctl
after the changes in https://github.com/vitessio/vitess/pull/15791 would start using the new mycnf file for versions over 8.0.26 that loads the new plugin. Let me look into when we use a my.cnf file. Maybe there is a flow that causes Vitess to start using the new my.cnf file which loads the new plugin, but when you downgrade, then the old my.cnf file doesn't take effect 🤷♂️. I'll look into when we re-initalize my.cnf files and maybe that will tell us more.
For an immediate workaround though, for anyone facing this issue. To downgrade, just load the old plugins instead of the new ones in my.cnf and restart mysql so that they take effect.
I looked into this further today, and here is what I found. We reinitialize the my.cnf
file in the following cases -
init
, init-config
, reinit_config
etc commands of mysqlctl. mysqlctld
is started, it either creates the new my.cnf file or it updates the old one incase the generated one is differentSo, what I believe happened is as follows -
@derekperkins Could you let me know if ☝️ if this is the correct sequence of operations?
If it is, then the fix is to also downgrade mysqlctl
which would change the my.cnf to use old plugins again. We'll have to backport some part of https://github.com/vitessio/vitess/pull/15791 to v19 and do a patch or we'll have to do some changes in v20 so that it doesn't use the new plugins immediately and reintroduce the change in v21.
Overview of the Issue
We're seeing vttablet OOM incredibly fast on v20.0.0 for some reason, after running fine for a couple weeks. We attempted to downgrade to v19.0.4 to see if that changed anything, but the primary was unable to start. vtorc attempted to recover
UndoDemotePrimary
and couldn't ever succeed.SET GLOBAL rpl_semi_sync_master_enabled = 0, GLOBAL rpl_semi_sync_slave_enabled = 0) failed: Unknown system variable 'rpl_semi_sync_master_enabled'
When I reverted that change back to v20.0.0, vtorc was able to successfully run
UndoDemotePrimary
Related issues:
Reproduction Steps
This was tested on a single node keyspace with only a single tablet.
Binary Version
Operating System and Environment details
Log Fragments