signal18 / replication-manager

Signal 18 repman - Replication Manager for MySQL / MariaDB / Percona Server
https://signal18.io/products/srm
GNU General Public License v3.0
658 stars 168 forks source link

Rep-Man CREATE OR REPLACE TABLE SQL Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER #264

Closed sklasing closed 5 years ago

sklasing commented 5 years ago

Any time I startup MRM on centOS 6.10, MariaDB 10.3.10 using either latest release 2.0.1 and now the 2.1 dev version I get upwards of 18 create or replace spider table statements issued by replication-manager. Every one of the statements fails

Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER.

The VM based system and Spider Proxy DB were recycled, including the VM itself prior to the most recent test. I initially got the errors with a 7.5GB ram configuration and have since bumped the RAN upto 52GB using 40GB bufferpool to approximate 80% of RAM. The errors continue.

Note there is ZERO acxtivity on the system other than the replication-manager sql sent to the proxy node.

Example replication-manager.log immediately after bringing down the DB, recycling the VM, and bring it all backup. Zero business activity.

2018/12/13 18:58:21 [cluster_mdbshardproxy_shard2] ERROR - Failed query CREATE OR REPLACE TABLE replication_manager_schema.jobs ENGINE=spider comment='wrapper "mysql", table "jobs", srv "s11054090268322714256"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/13 18:58:21 [cluster_mdbshardproxy_shard3] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s15809700175834069483"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER

I then disabled spider_bgs_mode = 2 so that the proxy node would no longer perform parallel across backend shard nodes since that is known to be memory intensive, redid the entire tests descibed above and the replication-manager continues to receive out of memory data.

In summary, the identical create or replace statements GENERATED by replication-manager fail with the Error 1030 Out of memory and when executed directly on the proxy node they succeed.

The replication-manager config.toml is below. Let me know if other information is desired. [Cluster_Mdbshardproxy_Shard1] title = "Shard1" db-servers-hosts = " 99.0.2.1:3306, 99.0.3.1:3306" db-servers-prefered-master = " 99.0.2.1:3306" db-servers-credential = "spiderman:99999999" db-servers-connect-timeout = 1 replication-credential = "repman:99999999"

[Cluster_Mdbshardproxy_Shard2] title = "Shard2" db-servers-hosts = " 99.0.2.2:3306, 99.0.3.2:3306" db-servers-prefered-master = " 99.0.2.2:3306" db-servers-credential = "spiderman:99999999" db-servers-connect-timeout = 1 replication-credential = "repman:99999999"

[Cluster_Mdbshardproxy_Shard3] title = "Shard3" db-servers-hosts = " 99.0.2.3:3306, 99.0.3.3:3306" db-servers-prefered-master = " 99.0.2.3:3306" db-servers-credential = "spiderman:99999999" db-servers-connect-timeout = 1 replication-credential = "repman:99999999"

[Default] title = "Spider Proxy" shardproxy = true shardproxy-servers = " 99.0.1.1:3306, 99.0.1.2:3306, 99.0.1.3:3306" shardproxy-user = "spiderman:99999999"

mdbshardproxy = true mdbshardproxy-hosts = " 99.0.1.1:3306, 99.0.1.2:3306, 99.0.1.3:3306" mdbshardproxy-user = "spiderman:99999999"

pre-failover-script = "" post-failover-script = "" autorejoin-script = ""

monitoring-datadir = "/var/lib/replication-manager"

monitoring-sharedir = "/usr/share/replication-manager"

Timeout in seconds between consecutive monitoring

monitoring-ticker = 2

working-directory = "/var/lib/replication-manager" share-directory = "/usr/share/replication-manager" http-root = "/usr/share/replication-manager/dashboard"

logfile = "/var/log/replication-manager.log"

#########

LOG

######### log-file = "/var/log/replication-manager.log"

log-syslog = true

verbose = true

Key note: Immediately before starting replication-manager I can execute the exact same statements directly on the Spider Proxy Node with zero errors. Also immediatel;y after rep-man gets the errors I can execute the same successfully directly on the proxy node.

Example: MariaDB [backend]> CREATE OR REPLACE TABLE sbtest ( id int(10) unsigned NOT NULL AUTO_INCREMENT, k int(10) unsigned NOT NULL DEFAULT 0, c char(120) NOT NULL DEFAULT '', pad char(60) NOT NULL DEFAULT '', PRIMARY KEY (id), KEY k (k) ) ENGINE=SPIDER DEFAULT CHARSET=latin1 COMMENT='wrapper "mysql", table "sbtest"' PARTITION BY HASH (id) (PARTITION pt1 COMMENT = 'srv "s15809700175834069483", tbl "sbtest", database "backend"' ENGINE = SPIDER, PARTITION pt2 COMMENT = 'srv "s3450708238616861104", tbl "sbtest", database "backend"' ENGINE = SPIDER, PARTITION pt3 COMMENT = 'srv "s7514566272099706500", tbl "sbtest", database "backend"' ENGINE = SPIDER); Query OK, 0 rows affected (0.024 sec)

svaroqui commented 5 years ago

Hi,

Replication-manager is loading spider system tables on startup, i have introduce a way to stop this and than supposing you do that on your own shardproxy-load-system = false

In file https://github.com/signal18/replication-manager/blob/2.1/cluster/prx_shardproxy.go The function ShardProxyBootstrap is doing same job as the install spider script but is not versioned could you try to found out if there is a difference in system table or procedure here.

If not it looks like i need a way to stop doing this any time we restart the replication-manager , may be it's not indented to be executed multiple times.

Thanks for all !

Le jeu. 13 déc. 2018 à 20:59, SAK notifications@github.com a écrit :

Any time I startup MRM on centOS 6.10, MariaDB 10.3.10 using either latest release 2.0.1 and now the 2.1 dev version I get upwards of 18 create or replace spider table statements issued by replication-manager. Every one of the statements fails

Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER.

The VM based system and Spider Proxy DB were recycled, including the VM itself prior to the most recent test. I initially got the errors with a 7.5GB ram configuration and have since bumped the RAN upto 52GB using 40GB bufferpool to approximate 80% of RAM. The errors continue.

Note there is ZERO acxtivity on the system other than the replication-manager sql sent to the proxy node. Example replication-manager.log immediately after bringing down the DB, recycling the VM, and bring it all backup. Zero business activity.

2018/12/13 18:58:21 [cluster_mdbshardproxy_shard2] ERROR - Failed query CREATE OR REPLACE TABLE replication_manager_schema.jobs ENGINE=spider comment='wrapper "mysql", table "jobs", srv "s11054090268322714256"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/13 18:58:21 [cluster_mdbshardproxy_shard3] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s15809700175834069483"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER

I then disabled spider_bgs_mode = 2 so that the proxy node would no longer perform parallel across backend shard nodes since that is known to be memory intensive, redid the entire tests descibed above and the replication-manager continues to receive out of memory data.

In summary, the identical create or replace statements GENERATED by replication-manager fail with the Error 1030 Out of memory and when executed directly on the proxy node they succeed.

The replication-manager config.toml is below. Let me know if other information is desired. [Cluster_Mdbshardproxy_Shard1] title = "Shard1" db-servers-hosts = " 99.0.2.1:3306, 99.0.3.1:3306" db-servers-prefered-master = " 99.0.2.1:3306" db-servers-credential = "spiderman:99999999" db-servers-connect-timeout = 1 replication-credential = "repman:99999999"

[Cluster_Mdbshardproxy_Shard2] title = "Shard2" db-servers-hosts = " 99.0.2.2:3306, 99.0.3.2:3306" db-servers-prefered-master = " 99.0.2.2:3306" db-servers-credential = "spiderman:99999999" db-servers-connect-timeout = 1 replication-credential = "repman:99999999"

[Cluster_Mdbshardproxy_Shard3] title = "Shard3" db-servers-hosts = " 99.0.2.3:3306, 99.0.3.3:3306" db-servers-prefered-master = " 99.0.2.3:3306" db-servers-credential = "spiderman:99999999" db-servers-connect-timeout = 1 replication-credential = "repman:99999999"

[Default] title = "Spider Proxy" shardproxy = true shardproxy-servers = " 99.0.1.1:3306, 99.0.1.2:3306, 99.0.1.3:3306" shardproxy-user = "spiderman:99999999"

mdbshardproxy = true mdbshardproxy-hosts = " 99.0.1.1:3306, 99.0.1.2:3306, 99.0.1.3:3306" mdbshardproxy-user = "spiderman:99999999"

pre-failover-script = "" post-failover-script = "" autorejoin-script = "" monitoring-datadir = "/var/lib/replication-manager" monitoring-sharedir = "/usr/share/replication-manager" Timeout in seconds between consecutive monitoring monitoring-ticker = 2

working-directory = "/var/lib/replication-manager" share-directory = "/usr/share/replication-manager" http-root = "/usr/share/replication-manager/dashboard"

logfile = "/var/log/replication-manager.log"

######### LOG

######### log-file = "/var/log/replication-manager.log" log-syslog = true

verbose = true

Key note: Immediately before starting replication-manager I can execute the exact same statements directly on the Spider Proxy Node with zero errors. Also immediatel;y after rep-man gets the errors I can execute the same successfully directly on the proxy node.

Example: MariaDB [backend]> CREATE OR REPLACE TABLE sbtest ( id int(10) unsigned NOT NULL AUTO_INCREMENT, k int(10) unsigned NOT NULL DEFAULT 0, c char(120) NOT NULL DEFAULT '', pad char(60) NOT NULL DEFAULT '', PRIMARY KEY (id), KEY k (k) ) ENGINE=SPIDER DEFAULT CHARSET=latin1 COMMENT='wrapper "mysql", table "sbtest"' PARTITION BY HASH (id) (PARTITION pt1 COMMENT = 'srv "s15809700175834069483", tbl "sbtest", database "backend"' ENGINE = SPIDER, PARTITION pt2 COMMENT = 'srv "s3450708238616861104", tbl "sbtest", database "backend"' ENGINE = SPIDER, PARTITION pt3 COMMENT = 'srv "s7514566272099706500", tbl "sbtest", database "backend"' ENGINE = SPIDER); Query OK, 0 rows affected (0.024 sec)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/signal18/replication-manager/issues/264, or mute the thread https://github.com/notifications/unsubscribe-auth/AC1RIIv619uonlqcdFQrlMi2qx9NlfD_ks5u4rGtgaJpZM4ZSUk- .

sklasing commented 5 years ago

Will implement and test shardproxy-load-system = false. I was looking for such a parameter.

My personal, inital reaction is no, this is not something I want replication-manager doing for me EXCEPT when I fail-over a node the idea of having rep-man coordinate rebuilding Spider defnitions to reflect the writes to a new shard master.

With that said I am trying to mange the master nodes via server definitions and not via table definitions meaning all tables share the same sharded environments so I can avoid having to admin all tables when a failover occurs. Table admin simply must PROCESS scale so the tweak at point of fail-over should be inexpensive, for the administator and for the db.

With that said, if the above is truely necessary, then it should be articulated as to why and done as a script the DBA has control of.

I believe there is another issue though, with rep-man rebuilding these definitions, and possibly a config issue. To be clear the same exact Create or replace statements do not error when I hand execute them on the proxy, they only error out of memory when executed by rep-man so it appears to be perhaps a configuration issue itself on the rep-man node which is separate from the spider proxy node. Rep-man nodes are currently 7.5GB ram. Proxy nodes are 52GB ram.

sklasing commented 5 years ago

I redownloaded the 2.1 developer version of rep-man and reinstalled. Rep-Man continues trying to create or replace the spider tables.

I set shardproxy-load-system = false in the config.toml

Any and all help is appreciated, in particular looking for confirmation of the correct download. Download details after the error messages.

cat /var/log/replication-manager.log | grep -ai "create or replace" | wc -l 18 cat /var/log/replication-manager.log | grep -ai "create or replace" 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard3] ERROR - Failed query CREATE OR REPLACE TABLE replication_manager_schema.jobs ENGINE=spider comment='wrapper "mysql", table "jobs", srv "s3046201595497614847"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard1] ERROR - Failed query CREATE OR REPLACE TABLE replication_manager_schema.jobs ENGINE=spider comment='wrapper "mysql", table "jobs", srv "s16051530979220499876"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard2] ERROR - Failed query CREATE OR REPLACE TABLE replication_manager_schema.jobs ENGINE=spider comment='wrapper "mysql", table "jobs", srv "s11054090268322714256"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard3] ERROR - Failed query CREATE OR REPLACE TABLE replication_manager_schema.jobs ENGINE=spider comment='wrapper "mysql", table "jobs", srv "s3046201595497614847"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard1] ERROR - Failed query CREATE OR REPLACE TABLE replication_manager_schema.jobs ENGINE=spider comment='wrapper "mysql", table "jobs", srv "s16051530979220499876"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard2] ERROR - Failed query CREATE OR REPLACE TABLE replication_manager_schema.jobs ENGINE=spider comment='wrapper "mysql", table "jobs", srv "s11054090268322714256"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard3] ERROR - Failed query CREATE OR REPLACE TABLE replication_manager_schema.jobs ENGINE=spider comment='wrapper "mysql", table "jobs", srv "s3046201595497614847"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard1] ERROR - Failed query CREATE OR REPLACE TABLE replication_manager_schema.jobs ENGINE=spider comment='wrapper "mysql", table "jobs", srv "s16051530979220499876"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard2] ERROR - Failed query CREATE OR REPLACE TABLE replication_manager_schema.jobs ENGINE=spider comment='wrapper "mysql", table "jobs", srv "s11054090268322714256"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard3] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s15809700175834069483"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard1] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s3450708238616861104"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard2] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s7514566272099706500"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard3] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s15809700175834069483"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard1] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s3450708238616861104"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard2] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s7514566272099706500"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard3] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s15809700175834069483"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard2] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s7514566272099706500"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 18:35:28 [cluster_mdbshardproxy_shard1] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s3450708238616861104"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER [xyzuser]#

I downloaded: http://ci.signal18.io/mrm/builds/2.1/replication-manager-osc-2.1.0_dev_598_g8acfd-1.x86_64.rpm yum remove replication-manager-osc rpm -i /home/sklasing/replication-manager-osc-2.1.0_dev_598_g8acfd-1.x86_64.rpm

My config.toml is below: [Cluster_Mdbshardproxy_Shard1] title = "Shard1" db-servers-hosts = "99.0.2.1:3306,99.0.3.1:3306" db-servers-prefered-master = "99.0.2.1:3306" db-servers-credential = "spiderman:99999999" db-servers-connect-timeout = 1 replication-credential = "repman:99999999"

[Cluster_Mdbshardproxy_Shard2] title = "Shard2" db-servers-hosts = "99.0.2.2:3306,99.0.3.2:3306" db-servers-prefered-master = "99.0.2.2:3306" db-servers-credential = "spiderman:99999999" db-servers-connect-timeout = 1 replication-credential = "repman:99999999"

[Cluster_Mdbshardproxy_Shard3] title = "Shard3" db-servers-hosts = "99.0.2.3:3306,99.0.3.3:3306" db-servers-prefered-master = "99.0.2.3:3306" db-servers-credential = "spiderman:99999999" db-servers-connect-timeout = 1 replication-credential = "repman:99999999"

[Default] title = "Spider Proxy" shardproxy = true shardproxy-servers = "99.0.1.1:3306,99.0.1.2:3306,99.0.1.3:3306" shardproxy-user = "spiderman:99999999" shardproxy-load-system = false

mdbshardproxy = true mdbshardproxy-hosts = "99.0.1.1:3306,99.0.1.2:3306,99.0.1.3:3306" mdbshardproxy-user = "spiderman:99999999"

pre-failover-script = "" post-failover-script = "" autorejoin-script = ""

working-directory = "/var/lib/replication-manager" share-directory = "/usr/share/replication-manager"

Any and all help is appreciated, in particular looking for confirmation of the correct download.

tanji commented 5 years ago

Hi @sklasing, our build system is broken, please let me fix the issue then I'll get back to you with the latest build.

tanji commented 5 years ago

Please download: http://ci.signal18.io/mrm/builds/2.1/replication-manager-osc-2.1.0_dev_603_g3d8d-1.x86_64.rpm

sklasing commented 5 years ago

will do, and thanks, should have test results today

sklasing commented 5 years ago

rep-man continues trying to rebuild the spider schemas. Of particular note I have usually seen it attempt to rebuild the two tables (replication_manager_schema.jobs and backend.sbtest) which are the only known spider tables at this point in the system.

Prior to the switch to turn the load off it would attempt each 9 times, for a total 18 out of memory failures. It appears to be attempting per 3 shards 3 times each. What does not make sense is the messages are identified by the individual shards when the CREATE OR REPLACE statements are spider proxy node ddl, NOT backend ddl.

Since the shardproxy-load-system = false I have seen it do it only 6 times instead of 18 and recently 15 times as noted by the wc -l on the rep log below. I have been clearing the log with every restart to make viewing/debugging easier. As stated the counts are per restart and they seem to vary.

The log 2018/12/19 19:56:32 [cluster_mdbshardproxy_shard3] ERROR - Failed query CREATE OR REPLACE TABLE replication_manager_schema.jobs ENGINE=spider comment='wrapper "mysql", table "jobs", srv "s3046201595497614847"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 19:56:32 [cluster_mdbshardproxy_shard1] ERROR - Failed query CREATE OR REPLACE TABLE replication_manager_schema.jobs ENGINE=spider comment='wrapper "mysql", table "jobs", srv "s16051530979220499876"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 19:56:32 [cluster_mdbshardproxy_shard2] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s7514566272099706500"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 19:56:32 [cluster_mdbshardproxy_shard1] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s3450708238616861104"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 19:56:32 [cluster_mdbshardproxy_shard1] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s3450708238616861104"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 19:56:32 [cluster_mdbshardproxy_shard1] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s3450708238616861104"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER [root@vpn-replication-manager-01 log]# vim /etc/replication-manager/config.toml [root@vpn-replication-manager-01 log]# cat replication-manager.log | grep -ai "create or replace" | wc -l 15 [root@vpn-replication-manager-01 log]# replication-manager-cli version Replication Manager 2.1.0-dev for MariaDB 10.x and MySQL 5.7 Series Full Version: 2.1.0-dev-603-g3d8da6b8 Build Time: 2018-12-19T19:04:21+0000 [root@vpn-replication-manager-01 log]# replication-manager-osc version Replication Manager 2.1.0-dev for MariaDB 10.x and MySQL 5.7 Series Full Version: 2.1.0-dev-603-g3d8da6b8 Build Time: 2018-12-19T19:02:46+0000 [root@vpn-replication-manager-01 log]#

svaroqui commented 5 years ago

Hi,

https://jira.mariadb.org/browse/MDEV-17224

According to this bug is that possible that you already have tables sbtest and jobs in the spider nodes in this case please remove them before starting replication-manager.

Tx

Le mer. 19 déc. 2018 à 21:10, SAK notifications@github.com a écrit :

rep-man continues trying to rebuild the spider schemas. Of particular note I have usually seen it attempt to rebuild the two tables ( replication_manager_schema.jobs and backend.sbtest) which are the only known spider tables at this point in the system.

Prior to the switch to turn the load off it would attempt each 9 times, for a total 18 out of memory failures. It appears to be attempting per 3 shards 3 times each. What does not make sense is the messages are identified by the individual shards when the CREATE OR REPLACE statements are spider proxy node ddl, NOT backend ddl.

Since the shardproxy-load-system = false I have seen it do it only 6 times instead of 18 and recently 15 times as noted by the wc -l on the rep log below. I have been clearing the log with every restart to make viewing/debugging easier. As stated the counts are per restart and they seem to vary.

The log 2018/12/19 19:56:32 [cluster_mdbshardproxy_shard3] ERROR - Failed query CREATE OR REPLACE TABLE replication_manager_schema.jobs ENGINE=spider comment='wrapper "mysql", table "jobs", srv "s3046201595497614847"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 19:56:32 [cluster_mdbshardproxy_shard1] ERROR - Failed query CREATE OR REPLACE TABLE replication_manager_schema.jobs ENGINE=spider comment='wrapper "mysql", table "jobs", srv "s16051530979220499876"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 19:56:32 [cluster_mdbshardproxy_shard2] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s7514566272099706500"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 19:56:32 [cluster_mdbshardproxy_shard1] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s3450708238616861104"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 19:56:32 [cluster_mdbshardproxy_shard1] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s3450708238616861104"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER 2018/12/19 19:56:32 [cluster_mdbshardproxy_shard1] ERROR - Failed query CREATE OR REPLACE TABLE backend.sbtest ENGINE=spider comment='wrapper "mysql", table "sbtest", srv "s3450708238616861104"' Error 1030: Got error 128 "Out of memory in engine" from storage engine SPIDER [root@vpn-replication-manager-01 log]# vim /etc/replication-manager/config.toml [root@vpn-replication-manager-01 log]# cat replication-manager.log | grep -ai "create or replace" | wc -l 15 [root@vpn-replication-manager-01 log]# replication-manager-cli version Replication Manager 2.1.0-dev for MariaDB 10.x and MySQL 5.7 Series Full Version: 2.1.0-dev-603-g3d8da6b8 Build Time: 2018-12-19T19:04:21+0000 [root@vpn-replication-manager-01 log]# replication-manager-osc version Replication Manager 2.1.0-dev for MariaDB 10.x and MySQL 5.7 Series Full Version: 2.1.0-dev-603-g3d8da6b8 Build Time: 2018-12-19T19:02:46+0000 [root@vpn-replication-manager-01 log]#

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/signal18/replication-manager/issues/264#issuecomment-448727961, or mute the thread https://github.com/notifications/unsubscribe-auth/AC1RILQwhAy7NL4CPfdPZTj5ht7nTP-sks5u6p0bgaJpZM4ZSUk- .

sklasing commented 5 years ago

I will do so while admitting I am confused.

It is quite possible I am missing something as to why rep-man does this, but the frequency concerns me since it is changing DDL when it does not appear to need to.

Also note while rep-man has been running in the last two hours it has rebuilt the create or replace server statements 400 plus times.** Note none of the create or replace server statements show in the rep-man log (presumption since they succeed) but they do show in the spider proxy node audit logs with the IP of the rep-man node as the requester.

Ideally during a failover only the one specific server definition is modified. That change also replicates to the proxy slaves. Once would think the table definitions should not have to be modified. I am still testing this without rep-man running, that is change only the server def of the failing shard backend cluster. Hope it works; if it does not then I am about to learn more about Spider internals.

sklasing commented 5 years ago

The shardproxy-load-system = false does not work and continues to modify create server and create spider table DDL at start up. Either that or I misunderstood its purpose.

I proved on Friday last week one can modify only the one failed-over server definition to the newly chosen shard specific master and traffic continues. There is absolutely no need for the massive data definition language changes being performed by rep-man where rep man continuously modifies the create server statements unnecessarily and at start up of rep-man attempts to modify the individual spider table definitions. As stated before I witnessed 400 create or replace server statements in 2 hours with only 3 shards.

This needs to be placed under the DBAs control via fail-over script so the shop can choose how spider definitions will be maintained. 99% of the DBAs I know would choose the minimalist strategy where if it is not broken don't modify it. They would also choose a strategy of maintaining shard master status via one definition not via every table definition. Even administrative processes must process scale.

This design fulcrum alone will make or break our shops decision to use rep-man for fail-over purposes.

In the least the reason for the continuous DDL modifications should be clearly explained as to why it is deemed necessary. I suspect the massive ddl modification has to do with covering all bases, in particular, Spiders internal means for monitoring individual tables and replicating them to other shards as an internal individual table fail-over backup of data means. If an individual table fails access, failover the master to the slave.

In this shop we will not be implementing the above "internal spider monitoring" because it will not scale.

Main point there is zero need to maintain master shard node fail-over status per table definition when only the server definition requires maintenance to achieve the fail-over and the same server definitions should not be repeatedly maintained when over all shard architecture state has not changed.

Rep-man is modifying the table definitions to the IP of the new master shard node when that is better maintained in the server definition and the table definition should only reference the name of the backend server. The only time the table would require modification is if a partition is added or removed or if there is an actual business need to modify the attributes of the table.

svaroqui commented 5 years ago

Please test last 2.1 release and reopen if still happen