Server has gone away error when set multiple mysql servers and max_replication_lag

sysown / proxysql

High-performance MySQL proxy with a GPL license.

http://www.proxysql.com

GNU General Public License v3.0

6.05k stars 983 forks source link

Server has gone away error when set multiple mysql servers and max_replication_lag #4513

Open TokuiNico opened 7 months ago

TokuiNico commented 7 months ago

Docker Image

ProxySQL version 2.6.0
OS version Debian GNU/Linux 11 (bullseye)

I've set up a hostgroup with two servers, db_primary and db_replica, to manage read-only connections. The db_replica is configured with a max_replication_lag of 1800 seconds. Ideally, most read-only traffic should route to db_replica unless its replication lag exceeds 1800 seconds, at which point connections should switch to db_primary.

hostgroup_id	hostname	port	gtid_port	status	weight	compression	max_connections	max_replication_lag	use_ssl	max_latency_ms	comment
1	db_primary	3306	0	ONLINE	1	0	5000	0	0	0	primady db for readonly user
1	db_replica	3306	0	ONLINE	100000	0	5000	1800	0	0	replica db for readonly user

However, while the expectation is that new read-only queries will redirect to db_primary without impacting active connections when db_replica's lag surpasses 1800 seconds, the actual switch disrupts existing queries, resulting in a "Server has gone away" error and preventing a smooth transition to db_primary.

JavierJF commented 7 months ago

Hi @TokuiNico,

could you please share the full error log? As the issue template mentions, it's a requisite for potential bug reports. Since the error is "Server has gone away", that might be an indication that MySQL itself terminated the connections. If you could share also MySQL error log during the time of this errors it will also help to troubleshoot the issue.

Thank you, regards. Javier.

TokuiNico commented 7 months ago

Hi, @JavierJF

I've checked the MySQL error log and didn't find anything relevant. Could this issue be related to using Peewee's connection pool and opening transactions during queries?

Also, another question: with ProxySQL in use, is there still a need for a client-side connection pool?

JavierJF commented 7 months ago

Hi @TokuiNico,

I've checked the MySQL error log and didn't find anything relevant. Could this issue be related to using Peewee's connection pool and opening transactions during queries?

Without the requested information, for now, the full ProxySQL error log, and at least the sections of the MySQL error log at the time of the incidents (would be convenient to have increased MySQL verbosity first, via SET GLOBAL log_error_verbosity=3;), it's early for making hypothesis.

Also, another question: with ProxySQL in use, is there still a need for a client-side connection pool?

That depends on your use case, ProxySQL helps with fast connections creation in the frontend, since those connections are much more lightweight that MySQL ones. Maybe this combined with an already warmed up connection pool is sufficient for your use case, determining if it's is something that requires measurements.

Thanks, Javier.

TokuiNico commented 7 months ago

I'll set log_error_verbosity and collect error log if similar issue happened.

Thanks.

kashak88 commented 6 months ago

I can confirm that this happens for us too. proxysql 2.5.5.

mysql> show mysql servers;
+--------------+------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| hostgroup_id | hostname   | port | gtid_port | status | weight | compression | max_connections | max_replication_lag | use_ssl | max_latency_ms | comment |
+--------------+------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
| 0            | 10.37.1.40 | 3306 | 0         | ONLINE | 1000   | 0           | 300             | 0                   | 0       | 0              |         |
| 1            | 10.37.1.40 | 3306 | 0         | ONLINE | 1      | 0           | 300             | 0                   | 0       | 0              |         |
| 1            | 10.37.1.41 | 3306 | 0         | ONLINE | 1000   | 0           | 300             | 1                   | 0       | 0              |         |
| 2            | 10.37.1.41 | 3306 | 0         | ONLINE | 1000   | 0           | 300             | 0                   | 0       | 0              |         |
+--------------+------------+------+-----------+--------+--------+-------------+-----------------+---------------------+---------+----------------+---------+
4 rows in set (0.01 sec)

We already have verbosity set to 3 on percona server and this happened two days in a row recently:

2024-05-12T05:03:24.618050Z 31972 [Note] Aborted connection 31972 to db: 'p2m_lcl' user: 'web_P2M' host: '10.37.1.50' (Got an error writing communication packets)
2024-05-13T05:04:25.611231Z 62680 [Note] Aborted connection 62680 to db: 'p2m_lcl' user: 'web_P2M' host: '10.37.1.50' (Got an error writing communication packets)

Proxysql log entries:

2024-05-12 01:03:24 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 3 second, count number: '1'
2024-05-12 01:03:24 MySQL_Session.cpp:4225:handler_ProcessingQueryError_CheckBackendConnectionStatus(): [ERROR] Detected a lagging server during query: 10.37.1.41, 3306
.....
2024-05-13 01:04:25 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 3 second, count number: '1'
2024-05-13 01:04:25 MySQL_Session.cpp:4225:handler_ProcessingQueryError_CheckBackendConnectionStatus(): [ERROR] Detected a lagging server during query: 10.37.1.41, 3306

app log entries:

[2024-05-12 01:03:24] [SUBPROGRAM-4] [symfony] [crit] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
[2024-05-12 01:03:24] [SUBPROGRAM-4] [symfony] [err] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
[2024-05-12 01:03:24] [SUBPROGRAM-4] [symfony] [crit] #0 /var/www/imsapps/p2m/lib/vendor/symfony/lib/plugins/sfDoctrinePlugin/lib/vendor/doctrine/Doctrine/Connection/Statement.php(253): Doctrine_Connection->rethrowException(Object(PDOException), Object(Doctrine_Connection_Statement))
#1 /var/www/imsapps/p2m/lib/vendor/symfony/lib/plugins/sfDoctrinePlugin/lib/vendor/doctrine/Doctrine/Connection.php(1036): Doctrine_Connection_Statement->execute(Array)
#2 /var/www/imsapps/p2m/lib/vendor/symfony/lib/plugins/sfDoctrinePlugin/lib/vendor/doctrine/Doctrine/Query/Abstract.php(992): Doctrine_Connection->execute('SELECT p.id AS ...', Array)
#3 /var/www/imsapps/p2m/lib/vendor/symfony/lib/plugins/sfDoctrinePlugin/lib/vendor/doctrine/Doctrine/Query/Abstract.php(1036): Doctrine_Query_Abstract->_execute(Array)
#4 /var/www/imsapps/p2m/lib/model/doctrine/OrgnodeTable.class.php(940): Doctrine_Query_Abstract->execute(Array, 3)
#5 /var/www/imsapps/p2m/lib/model/doctrine/TosTable.class.php(44): OrgnodeTable->getAncestorIds(NULL)
#6 /var/www/imsapps/p2m/lib/model/doctrine/SystemMail.class.php(454): TosTable->findByOrgnodeOrDefault(Object(Member))
#7 /var/www/imsapps/p2m/lib/model/doctrine/SystemMail.class.php(495): SystemMail::disableCommunicationByTos(Array)
#8 /var/www/imsapps/p2m/plugins/cgMailerPlugin/lib/cgMailer.class.php(65): SystemMail::disableCommunication(Array, 'processingError')
#9 /var/www/imsapps/p2m/plugins/cgImporterPlugin/lib/cgImporter.class.php(488): cgMailer::addToQueueByApp('admin', 'ai_email', 'processingError', Array)
#10 /var/www/imsapps/p2m/lib/model/doctrine/HierarchyImportQueue.class.php(109): cgImporter->sendErrorEmail('/var/ims/data/c...', 'Importer Crashe...', '/var/ims/data/c...')
#11 /var/www/imsapps/p2m/lib/model/doctrine/HierarchyImportQueue.class.php(35): HierarchyImportQueue->import(Object(SubProgram))
#12 /var/www/imsapps/p2m/batch/rabbit/file/processFileProcessingQueue.php(89): HierarchyImportQueue->execute()
#13 /var/www/imsapps/p2m/batch/rabbit/file/processFileProcessingQueue.php(128): executeFileProcessingQueue(Array)
#14 {main}
[2024-05-12 01:03:24] [SUBPROGRAM-4] [symfony] [crit] [HIERARCHY IMPORT ID:18756 4] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
.....
[2024-05-13 01:04:30] [SUBPROGRAM-4] [symfony] [crit] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
[2024-05-13 01:04:30] [SUBPROGRAM-4] [symfony] [err] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
[2024-05-13 01:04:30] [SUBPROGRAM-4] [symfony] [crit] #0 /var/www/imsapps/p2m/lib/vendor/symfony/lib/plugins/sfDoctrinePlugin/lib/vendor/doctrine/Doctrine/Connection/Statement.php(253): Doctrine_Connection->rethrowException(Object(PDOException), Object(Doctrine_Connection_Statement))
#1 /var/www/imsapps/p2m/lib/vendor/symfony/lib/plugins/sfDoctrinePlugin/lib/vendor/doctrine/Doctrine/Connection.php(1036): Doctrine_Connection_Statement->execute(Array)
#2 /var/www/imsapps/p2m/lib/vendor/symfony/lib/plugins/sfDoctrinePlugin/lib/vendor/doctrine/Doctrine/Query/Abstract.php(992): Doctrine_Connection->execute('SELECT p.id AS ...', Array)
#3 /var/www/imsapps/p2m/lib/vendor/symfony/lib/plugins/sfDoctrinePlugin/lib/vendor/doctrine/Doctrine/Query/Abstract.php(1036): Doctrine_Query_Abstract->_execute(Array)
#4 /var/www/imsapps/p2m/lib/model/doctrine/OrgnodeTable.class.php(940): Doctrine_Query_Abstract->execute(Array, 3)
#5 /var/www/imsapps/p2m/lib/model/doctrine/TosTable.class.php(44): OrgnodeTable->getAncestorIds(NULL)
#6 /var/www/imsapps/p2m/lib/model/doctrine/SystemMail.class.php(454): TosTable->findByOrgnodeOrDefault(Object(Member))
#7 /var/www/imsapps/p2m/lib/model/doctrine/SystemMail.class.php(495): SystemMail::disableCommunicationByTos(Array)
#8 /var/www/imsapps/p2m/plugins/cgMailerPlugin/lib/cgMailer.class.php(65): SystemMail::disableCommunication(Array, 'processingError')
#9 /var/www/imsapps/p2m/plugins/cgImporterPlugin/lib/cgImporter.class.php(488): cgMailer::addToQueueByApp('admin', 'ai_email', 'processingError', Array)
#10 /var/www/imsapps/p2m/lib/model/doctrine/HierarchyImportQueue.class.php(109): cgImporter->sendErrorEmail('/var/ims/data/c...', 'Importer Crashe...', '/var/ims/data/c...')
#11 /var/www/imsapps/p2m/lib/model/doctrine/HierarchyImportQueue.class.php(35): HierarchyImportQueue->import(Object(SubProgram))
#12 /var/www/imsapps/p2m/batch/rabbit/file/processFileProcessingQueue.php(89): HierarchyImportQueue->execute()
#13 /var/www/imsapps/p2m/batch/rabbit/file/processFileProcessingQueue.php(128): executeFileProcessingQueue(Array)
#14 {main}
[2024-05-13 01:04:30] [SUBPROGRAM-4] [symfony] [crit] [HIERARCHY IMPORT ID:18764 4] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

JavierJF commented 6 months ago

Hi @kashak88,

Could you please provide the full ProxySQL error log during the incident, and also the relevant section of the MySQL error log? Fragments are okay, but for identifying some issues, many times is much better 1 hour of log than 1 min. Context is sometimes very important.

Thank you, Javier.

kashak88 commented 6 months ago

full mysql-error.log:

zcat mysql-error.log.10.gz
2024-05-14T04:39:11.759062Z 89255 [Note] Aborted connection 89255 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-14T04:41:51.807245Z 89301 [Note] Aborted connection 89301 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-14T05:05:56.434469Z 93320 [Note] Aborted connection 93320 to db: 'p2m_lcl' user: 'web_P2M' host: '10.37.1.50' (Got an error writing communication packets)
2024-05-14T06:56:56.674663Z 91900 [Note] Aborted connection 91900 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-14T07:02:17.223307Z 92121 [Note] Aborted connection 92121 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-14T09:51:01.885482Z 96258 [Note] Aborted connection 96258 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-14T11:31:00.180894Z 97372 [Note] Aborted connection 97372 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-14T11:37:17.932856Z 97493 [Note] Aborted connection 97493 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-14T12:04:13.432730Z 98463 [Note] Aborted connection 98463 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-14T12:36:27.587536Z 100076 [Note] Aborted connection 100076 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-14T15:40:25.245046Z 103995 [Note] Aborted connection 103995 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout reading communication packets)
2024-05-14T16:52:26.770681Z 105126 [Note] Aborted connection 105126 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-14T16:56:13.627711Z 105184 [Note] Aborted connection 105184 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-14T18:55:54.141051Z 107897 [Note] Aborted connection 107897 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout reading communication packets)
2024-05-14T18:57:42.685055Z 107960 [Note] Aborted connection 107960 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout reading communication packets)
2024-05-14T21:45:11.183963Z 111646 [Note] Aborted connection 111646 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-14T23:35:56.892822Z 112611 [Note] Aborted connection 112611 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-14T23:44:48.026567Z 112704 [Note] Aborted connection 112704 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-15T00:14:26.705450Z 117860 [Note] Aborted connection 117860 to db: 'unconnected' user: 'root' host: 'localhost' (Got an error reading communication packets)
2024-05-15T00:37:52.279054Z 113733 [Note] Aborted connection 113733 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-15T01:18:46.699365Z 115199 [Note] Aborted connection 115199 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-15T01:21:52.014640Z 115370 [Note] Aborted connection 115370 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout writing communication packets)
2024-05-15T03:26:48.605043Z 119882 [Note] Aborted connection 119882 to db: 'p2m_lcl' user: 'databox_dps' host: '10.19.1.40' (Got timeout reading communication packets)

Also attaching proxysql.log, the relevant entry:

2024-05-14 01:05:56 MySQL_Session.cpp:4225:handler_ProcessingQueryError_CheckBackendConnectionStatus(): [ERROR] Detected a lagging server during query: 10.37.1.41, 3306

proxysql.log-20240518.gz

JavierJF commented 6 months ago

Hi @kashak88 and @TokuiNico,

these logs show what I was expecting, and even If I would like to have a bigger section of MySQL error log, let's assume that the ProxySQL log line you supplied:

2024-05-14 01:05:56 MySQL_Session.cpp:4225:handler_ProcessingQueryError_CheckBackendConnectionStatus(): [ERROR] Detected a lagging server during query: 10.37.1.41, 3306

Is related with this MySQL log entry, since it's the only one which appears to be timing related:

2024-05-14T05:05:56.434469Z 93320 [Note] Aborted connection 93320 to db: 'p2m_lcl' user: 'web_P2M' host: '10.37.1.50' (Got an error writing communication packets)

This is why more logs are important, in your ProxySQL log there are multiple SHUNNING events against backend servers:

gunzip -kc proxysql.log-20240518.gz | grep "Shunning server" | wc -l
15

Yet, only a fraction of those events are related with ProxySQL error messages of lagging server during query:

gunzip -kc proxysql.log-20240518.gz | grep "Detected a lagging server during"
2024-05-12 01:03:24 MySQL_Session.cpp:4225:handler_ProcessingQueryError_CheckBackendConnectionStatus(): [ERROR] Detected a lagging server during query: 10.37.1.41, 3306
2024-05-13 01:04:25 MySQL_Session.cpp:4225:handler_ProcessingQueryError_CheckBackendConnectionStatus(): [ERROR] Detected a lagging server during query: 10.37.1.41, 3306
2024-05-14 01:05:56 MySQL_Session.cpp:4225:handler_ProcessingQueryError_CheckBackendConnectionStatus(): [ERROR] Detected a lagging server during query: 10.37.1.41, 3306

So, only by this data, it's much more likely that ProxySQL isn't the cause of this backend connections errors, but probably some real issues either in the backend server or in the network. The message that ProxySQL is issuing, isn't that it's either killing the backend connection, or performing an action on it due to the SHUNNING for the server, it's announcing that an error was found on a backend connection in which the server was detected to be LAGGING.

In short, these logs show no correlation between the events, SHUNNING is performed multiple times, yet only during a brief interval the traffic was impacted, and that itself points to potential network issues that could even be responsible for that lagging ProxySQL detected. I would suggest to check your monitoring during that time, that could show more information about the cause of those errors.

@TokuiNico so far there is no evidence of the behavior you were suggesting when the issue was opened, and the logs that are required by the issue template haven't being provided. Unless the logs are provided, I will be closing the issue since it's already misleading other issues into assuming there is an issue.

Thanks, regards. Javier.

kashak88 commented 6 months ago

Hi @JavierJF , I find it hard to believe that there is no correlation when the same issue happened 3 times in 3 days at the same time as SHUNNING event happened:

admin_prod.log.13.gz:[2024-05-14 01:06:01] [SUBPROGRAM-4] [symfony] [crit] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
admin_prod.log.13.gz:[2024-05-14 01:06:01] [SUBPROGRAM-4] [symfony] [err] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
admin_prod.log.13.gz:[2024-05-14 01:06:01] [SUBPROGRAM-4] [symfony] [crit] [HIERARCHY IMPORT ID:18772 4] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
admin_prod.log.14.gz:[2024-05-13 01:04:30] [SUBPROGRAM-4] [symfony] [crit] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
admin_prod.log.14.gz:[2024-05-13 01:04:30] [SUBPROGRAM-4] [symfony] [err] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
admin_prod.log.14.gz:[2024-05-13 01:04:30] [SUBPROGRAM-4] [symfony] [crit] [HIERARCHY IMPORT ID:18764 4] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
admin_prod.log.15.gz:[2024-05-12 01:03:24] [SUBPROGRAM-4] [symfony] [crit] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
admin_prod.log.15.gz:[2024-05-12 01:03:24] [SUBPROGRAM-4] [symfony] [err] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away
admin_prod.log.15.gz:[2024-05-12 01:03:24] [SUBPROGRAM-4] [symfony] [crit] [HIERARCHY IMPORT ID:18756 4] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

zgrep "Shunning server" proxysql.log-20240518.gz 
2024-05-12 01:03:24 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 3 second, count number: '1'
2024-05-12 05:23:34 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 9 second, count number: '1'
2024-05-12 05:24:14 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 7 second, count number: '1'
2024-05-13 01:04:25 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 3 second, count number: '1'
2024-05-13 05:30:35 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 6 second, count number: '1'
2024-05-13 07:49:55 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 9 second, count number: '1'
2024-05-14 01:05:56 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 4 second, count number: '1'
2024-05-14 05:36:26 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 7 second, count number: '1'
2024-05-15 05:29:07 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 6 second, count number: '1'
2024-05-15 05:29:37 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 7 second, count number: '1'
2024-05-16 05:28:19 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 6 second, count number: '1'
2024-05-16 05:28:49 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 8 second, count number: '1'
2024-05-17 05:11:20 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 7 second, count number: '1'
2024-05-17 05:11:50 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 8 second, count number: '1'
2024-05-18 05:15:30 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 6 second, count number: '1'

[root@p2m-db-lcl-replica mysql]# zgrep 'Got an error writing communication' mysql-error.log*
mysql-error.log.13.gz:2024-05-14T05:05:56.434469Z 93320 [Note] Aborted connection 93320 to db: 'p2m_lcl' user: 'web_P2M' host: '10.37.1.50' (Got an error writing communication packets)
mysql-error.log.14.gz:2024-05-13T05:04:25.611231Z 62680 [Note] Aborted connection 62680 to db: 'p2m_lcl' user: 'web_P2M' host: '10.37.1.50' (Got an error writing communication packets)
mysql-error.log.15.gz:2024-05-12T05:03:24.618050Z 31972 [Note] Aborted connection 31972 to db: 'p2m_lcl' user: 'web_P2M' host: '10.37.1.50' (Got an error writing communication packets)

Yes, there are way more SHUNNING events than errors for the application/mysql-error log entries but that's because we have a very small lag tolerance for read/write split hostgroup. This also seems to only happen during specific operation on our application (HIERARCHY IMPORT), so it could be that there is some very specific turn of events that need to happen for the issue to occur. Furthermore, if you look closely, those 15 events you mentioned are very spread out. We are on 10gbit network and all the involved VMs are inside the same local ESXI cluster. Plenty of RAM/CPU, systems are patched/rebooted monthly. That doesn't completely rule out a failure that could be unrelated to proxysql but I think it's wrong to dismiss the timing or mysql-error log entries + proxysql SHUNNING log entries, especially considering that I'm not the person that opened this ticket and others are experiencing similar issues. Hopefully @TokuiNico can chime in with logs of their own

JavierJF commented 6 months ago

Hi @kashak88,

let's go by parts:

I find it hard to believe that there is no correlation

Issues are handled based on the evidence received, not on believes. So, if there is no proof of correlation, and so far we are yet to see evidence of that, there is absolutely no point in assuming correlation, as this could even mislead the investigation process.

when the same issue happened 3 times in 3 days at the same time as SHUNNING event happened

Is there more context that you can provide? Or you will be feeding into the issue, one part at a time, context pieces that you think makes enough context?

Yes, there are way more SHUNNING events than errors for the application/mysql-error log entries but that's because we have a very small lag tolerance for read/write split hostgroup

This explains just the conditions for the SHUNNING operation, but offers no information at all about the correlation with the errors.

This also seems to only happen during specific operation on our application (HIERARCHY IMPORT), so it could be that there is some very specific turn of events that need to happen for the issue to occur.

Again new context that wasn't provided before. If you consider there are very specific turn of events that need to happen for the issue to occur don't you think that providing all the context you can, without bias, would be the best approach to help us find the issue?

Furthermore, if you look closely, those 15 events you mentioned are very spread out.

I don't need to look closely something that is obvious. And that fact doesn't add any extra information that is relevant for the troubleshooting.

That doesn't completely rule out a failure that could be unrelated to proxysql but I think it's wrong to dismiss the timing or mysql-error log entries + proxysql SHUNNING log entries,

Nobody has dismissed the timing or mysql-error log entries + proxysql SHUNNING log entries. I said, that there is so far, no evidence that the SHUNNING is involved in these particular errors. Quoting myself:

So, only by this data, it's much more likely that ProxySQL isn't the cause of this backend connections errors, but probably some real issues either in the backend server or in the network. The message that ProxySQL is issuing, isn't that it's either killing the backend connection, or performing an action on it due to the SHUNNING for the server, it's announcing that an error was found on a backend connection in which the server was detected to be LAGGING.

So to reiterate, the correlation of the errors and the SHUNNING isn't clear at this point. And the data and details that you have provided so far, doesn't allow to perform that connection. I have tried to explain what can be deduced from that data, with the very little context that has been provided in this issue, and what this particular error message [ERROR] Detected a lagging server during query really means. Let me continue elaborating in the next answer.

especially considering that I'm not the person that opened this ticket and others are experiencing similar

Again, too many assumptions, the two issues could be related, or could not be the same issue at all. We don't have data yet, just some generic error messages, and conditions that also usually indicate network issues, or at least, and increase in network latency, combined with particular conditions that your application is creating. Of course, as you might guess, I'm also lacking the context for that.

If you want you get to the bottom of the particular issue you are experiencing, stop self diagnosing the issue, share all the context that you can, and then, I will try to assist, asking for more information that we can use to diagnose it.

Thanks, Javier.

kashak88 commented 5 months ago

Hi Javier, I am able to replicate this issue using below script:

[root@p2m-processor-lcl ~]# cat test.php 
<?php
$servername = "127.0.0.1";
$username = "someuser";
$password = "somepass";
$dbname = "p2m_lcl";
$port = 6033; // Default ProxySQL port

// Create connection
$conn = new mysqli($servername, $username, $password, $dbname, $port);

// Check connection
if ($conn->connect_error) {
    die("Connection failed: " . $conn->connect_error);
}

while (true) {
    $sql = "SELECT * FROM members_details";
    if ($result = $conn->query($sql)) {
        echo "Query successful\n";
        $result->close();
    } else {
        echo "Error: " . $conn->error . "\n";
    }

    // Sleep for a short period before next query
    usleep(500000); // 0.5 second
}

$conn->close();
?>

After I execute above, it starts looping, all queries succeed. As soon as I introduce lag by running optimize table on a 140~ish MB table:

[root@p2m-db-lcl-master p2m_lcl]# mysql -e "optimize table p2m_lcl.p2m_award_transaction"
+-------------------------------+----------+----------+-------------------------------------------------------------------+
| Table                         | Op       | Msg_type | Msg_text                                                          |
+-------------------------------+----------+----------+-------------------------------------------------------------------+
| p2m_lcl.p2m_award_transaction | optimize | note     | Table does not support optimize, doing recreate + analyze instead |
| p2m_lcl.p2m_award_transaction | optimize | status   | OK                                                                |
+-------------------------------+----------+----------+-------------------------------------------------------------------+

Query successful
Query successful
Query successful
Query successful
Query successful
Query successful
Query successful
Query successful
Query successful
Query successful
Query successful
Error: 
Error: MySQL server has gone away
Error: MySQL server has gone away
Error: MySQL server has gone away
Error: MySQL server has gone away
Error: MySQL server has gone away
Error: MySQL server has gone away
Error: MySQL server has gone away
Error: MySQL server has gone away
Error: MySQL server has gone away

2024-06-06 12:43:09 MySQL_HostGroups_Manager.cpp:3490:replication_lag_action_inner(): [WARNING] Shunning server 10.37.1.41:3306 from HG 1 with replication lag of 3 second, count number: '1'
2024-06-06 12:43:09 MySQL_Session.cpp:4225:handler_ProcessingQueryError_CheckBackendConnectionStatus(): [ERROR] Detected a lagging server during query: 10.37.1.41, 3306
2024-06-06 12:43:19 MySQL_HostGroups_Manager.cpp:3514:replication_lag_action_inner(): [WARNING] Re-enabling server 10.37.1.41:3306 from HG 1 with replication lag of 0 second

[06-Jun-2024 12:43:10 America/Toronto] PHP Warning:  mysqli::query(): (00000/0):  in /root/test.php on line 18
[06-Jun-2024 12:43:10 America/Toronto] PHP Warning:  mysqli::query(): MySQL server has gone away in /root/test.php on line 18
[06-Jun-2024 12:43:10 America/Toronto] PHP Warning:  mysqli::query(): Error reading result set's header in /root/test.php on line 18

When I point the script directly at replica, I cannot replicate the same behaviour at all, but it's very easy to do when proxysql is involved. Some details about tables involved:

mysql> select count(*) from p2m_lcl.p2m_award_transaction;
+----------+
| count(*) |
+----------+
|   272755 |
+----------+
1 row in set (0.04 sec)

mysql> select count(*) from p2m_lcl.members_details;
+----------+
| count(*) |
+----------+
|   648596 |
+----------+
1 row in set (0.12 sec)

Ubuntu 20.04 php7.1 7.1.33-58+ubuntu20.04.1+deb.sury.org+1 mysql> select @@version; +---------------+ | @@version | +---------------+ | 5.7.44-48-log | +---------------+ 1 row in set (0.00 sec)

proxysql 2.5.5

JavierJF commented 5 months ago

Hi @kashak88 and @TokuiNico,

thanks @kashak88 for bringing more details about your reproduction case. It has helped me to realize that maybe there is a misconception about how SHUNNED BY REPLICATION LAG works, and that probably all there is to this issue is that should be better clarified and stated in the documentation. I have already checked and it's not clear in any of the relevant places it should be.

What helped me to realize this is that you were running queries that (I assume) don't immediately return. And that are already running queries the ones that are impacted by this SHUNNED state. This is something that maybe looks obvious to you, but I was fixated in thinking that ProxySQL was failing on queries after SHUNNED state took place, thus, this backend connections that were SHUNNED were impacting further traffic (this would be a real misbehavior for SHUNNED BY REPLICATION LAG, thus my bias into this fixed thinking).

So, to address what I think is the main point of the issue, I will quote @TokuiNico because it's the issue reporter and its original question got lost during the later exchanges, probably because of the waiting of logs for confirmation, and the other questions that were raised:

However, while the expectation is that new read-only queries will redirect to db_primary without impacting active connections when db_replica's lag surpasses 1800 seconds, the actual switch disrupts existing queries, resulting in a "Server has gone away" error and preventing a smooth transition to db_primary.

There are several conditions on which the switch to SHUNNED BY REPLICATION LAG will disrupt queries:

Multiplexing is disabled: Either globally or in the connection, this means that the backend connection is tied to the frontend connection, and if the backend connection is no longer a viable one, like in the case of SHUNNED BY REPLICATION LAG.
Transactions: Transactions disable multiplexing, so, connections with transactions will be impacted in the same way than previously described.
Currently running queries: Queries that don't immediately return, either due to processing time or because they are transferring data will also be impacted after the status switch.

In the previous cases, when ProxySQL finds that the server status has switch to SHUNNED by replication lag, the client connection is disconnected, as if the server was found OFFLINE. So, this transition is not guarantee to be smooth. This is intentional, and specific for this shunning state, regular SHUNNED state, doesn't exhibit these consequences. This should be better documented and I'm going to take care of it.

For completing the response for you @kashak88, since you provided logs, and I was fixed in the wrong scenario. Let me correct myself. These kind of client errors that you were seeing, with error 2006 being these ones:

[symfony] [crit] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away admin_prod.log.13.gz:[2024-05-14 01:06:01] [SUBPROGRAM-4] [symfony] [err] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away admin_prod.log.13.gz:[2024-05-14 01:06:01] [SUBPROGRAM-4] [symfony] [crit] [HIERARCHY IMPORT ID:18772 4] SQLSTATE[HY000]: General error: 2006 MySQL server has gone away

are probably directly related to these kind of messages that you were seeing in ProxySQL error log:

gunzip -kc proxysql.log-20240518.gz | grep "Detected a lagging server during" 2024-05-13 01:04:25 MySQL_Session.cpp:4225:handler_ProcessingQueryError_CheckBackendConnectionStatus(): [ERROR] Detected a lagging server during query: 10.37.1.41, 3306 2024-05-14 01:05:56 MySQL_Session.cpp:4225:handler_ProcessingQueryError_CheckBackendConnectionStatus(): [ERROR] Detected a lagging server during query: 10.37.1.41, 3306

(Note that I only say probably because 2006 is a very generic error, and log entries doesn't allow you to do the direct association, but in the absence of other errors in the time interval, they are related).

As this is the expected ProxySQL -> client interaction on the previously described scenarios. These errors were few, because as explained, for your case were likely just the times that ProxySQL found a already running query in the server that was flagged as SHUNNED. Technically, this doesn't exclude that there was an error already in the connection, and potential errors in MySQL side, but they are ignored by ProxySQL at that point.

If this behavior isn't what your were expecting (I assume not), I invite you to create a FR and make your case for smooth transitions when servers gets SHUNNED by replication lag. But, if this explanation satisfies your, please close the issue, as this is considered expected behavior.

Regards, Javier.

kashak88 commented 5 months ago

Hi @JavierJF , Thanks. I don't mind making a FR, just kind of confused why the behaviour isn't smooth to begin with? Genuinely curious. Are there technical drawbacks in letting existing connections with running queries just finish before fully shunning the server. As in block new connections but don't kill existing non-idle ones. There could be "heavy" reports(long queries) running when the lag occurs. FWIW we use a separate HG(unique port+query rule) that doesn't care about lag at all for these sorts of scenarios, but this issue is occurring in cases where, as you said, the result set is not returned immediately but queries aren't THAT slow either.

Thanks again

JavierJF commented 5 months ago

Hi @kashak88,

yw! I will try to clarify all your questions, and update the doc with these answers as I consider then important info.

just kind of confused why the behavior isn't smooth to begin with? Genuinely curious. Are there technical drawbacks in letting existing connections with running queries just finish before fully shunning the server. As in block new connections but don't kill existing non-idle ones. There could be "heavy" reports(long queries) running when the lag occurs.

In short, yes there are technical drawback in letting existing connections with running queries just finish. But mostly, because it conflict with the original motivation for SHUNNED BY REPLICATION LAG. This state is supposed to be a hard limit, meaning, that reads which are performed on a server above that replication lag, are not acceptable. With this kind of hard-limits, we try to take the safest approach we can. Now with this mindset, we can revisit the previous scenarios that are impacted:

Multiplexing is disabled: Since multiplexing is disabled, the client will keep doing queries to an already non-acceptable server. This connection can have a very long life-spam, doing unwanted queries, the safest thing is terminating.
Transactions: Same motivation as before.
Currently running queries: For this one, if the query return immediately there is nothing to do, the operation is "atomic" regarding to the status server change. Hence, previous operations succeed, further operations are denied. Now, if the query doesn't return immediately, it's not simple to decide whether you can or not safely let this query run without being affected by the replication_lag. If we are thinking of single table, simple queries, the answer is more trivial, but... what about store procedures? And why should we assume that the isolation level the connection or server is using allows for this long running query to not perform reads that wont comply with the hard-limit we have imposed?

So, in short, it isn't trivial to decide whether the query is safe to execute in the connection, and the safest and simplest option is not allowing this queries to run. As you can see, this is a matter of definitions, and use cases. Currently max_replication_lag is defined as hard-limit, and not read breaking it should be allowed. This contrast a more soft-limit kind of definition that is interesting if you know your workload, and you know the conditions of those "long-running" queries and so, you can determine it's safe to let those queries run.

this issue is occurring in cases where, as you said, the result set is not returned immediately but queries aren't THAT slow either.

Yes, because of the previously described, not returning immediately is enough for opening a time-window that will make it collide with the server status change.

FWIW we use a separate HG(unique port+query rule) that doesn't care about lag at all for these sorts of scenarios

This is the kind of solution we propose for these scenarios. If you don't care about lag for certain queries, you can just place your server in another hostgroup (without the max_replication_lag) and redirect those queries to that hostgroup via query rules. This way you ensure:

No queries for what you care about ever exceed the hard-limit of max_replication_lag.
Queries in which you don't care about replication lag, will run without restrictions.

Hope this info helps to clarify your doubts. Let me know otherwise.

Regards, Javier.