sysown / proxysql

High-performance MySQL proxy with a GPL license.
http://www.proxysql.com
GNU General Public License v3.0
5.94k stars 969 forks source link

Queries are still routed to a SHUNNED backend mysql server #4267

Open jermlinden opened 1 year ago

jermlinden commented 1 year ago

Queries are still occasionally being routed to a backend mysql server after it dies a rough & tumble death and proxysql marks it a SHUNNED

ProxySQL version 2.4.8-2 OS version: Debian 10.13

How I reproduced:

  1. Add a mysql backend server to the proxysql config
  2. Check to see that queries are being routed to it e.g. mysql -h proxysql-hostname -D database_name -e "SELECT DATABASE(), @@HOSTNAME"
  3. Shut the server down in a mean, malicious way (e.g. aws ec2 stop-instances)
  4. repeat: connect to proxysql with a query that would have routed to the downed server
MySQL [(none)]> select * from stats_mysql_connection_pool WHERE hostgroup=2\G
*************************** 1. row ***************************
        hostgroup: 2
         srv_host: int.mysql57-indra-mitra-1.dev.secondlife.io
         srv_port: 3306
           status: ONLINE
         ConnUsed: 0
         ConnFree: 1
           ConnOK: 1
          ConnERR: 0
      MaxConnUsed: 1
          Queries: 24626
Queries_GTID_sync: 0
  Bytes_data_sent: 714154
  Bytes_data_recv: 1231300
       Latency_us: 316
*************************** 2. row ***************************
        hostgroup: 2
         srv_host: int.mysql57-indra-mitra-666.dev.secondlife.io
         srv_port: 3306
           status: SHUNNED
         ConnUsed: 0
         ConnFree: 0
           ConnOK: 1
          ConnERR: 6
      MaxConnUsed: 1
          Queries: 1341
Queries_GTID_sync: 0
  Bytes_data_sent: 38889
  Bytes_data_recv: 69680
       Latency_us: 732
for i in {0..999} ; do mysql -h proxysql-hostname -D database_name -e "SELECT DATABASE(), @@HOSTNAME" ; done

almost all of the queries will go to the host that is still up, but occasionally a query will slip through to the shunned host and then fail

2023-06-20 23:19:25 MySQL_Monitor.cpp:2890:monitor_ping(): [ERROR] Server int.mysql57-indra-mitra-666.dev.secondlife.io:3306 missed 3 heartbeats, shunning it and killing all the connections. Disabling other checks until the node comes back online.
2023-06-20 23:19:34 MySQL_Session.cpp:2819:handler_again___status_CONNECTING_SERVER(): [ERROR] Max connect timeout reached while reaching hostgroup 2 after 10000ms . HG status: [{"Bytes_recv":"1169800","Bytes_sent":"678484","ConnERR":"0","ConnFree":"1","ConnOK":"1","ConnUsed":"0","Latency_us":"345","MaxConnUsed":"1","Queries":"23396","Queries_GTID_sync":"0","hostgroup":"2","srv_host":"int.mysql57-indra-mitra-1.dev.secondlife.io","srv_port":"3306","status":"ONLINE"},{"Bytes_recv":"69680","Bytes_sent":"38889","ConnERR":"6","ConnFree":"0","ConnOK":"1","ConnUsed":"1","Latency_us":"732","MaxConnUsed":"1","Queries":"1341","Queries_GTID_sync":"0","hostgroup":"2","srv_host":"int.mysql57-indra-mitra-666.dev.secondlife.io","srv_port":"3306","status":"SHUNNED"}]
2023-06-20 23:20:25 MySQL_Monitor.cpp:2890:monitor_ping(): [ERROR] Server int.mysql57-indra-mitra-666.dev.secondlife.io:3306 missed 3 heartbeats, shunning it and killing all the connections. Disabling other checks until the node comes back online.
2023-06-20 23:20:34 MySQL_Session.cpp:2819:handler_again___status_CONNECTING_SERVER(): [ERROR] Max connect timeout reached while reaching hostgroup 2 after 10000ms . HG status: [{"Bytes_recv":"1190000","Bytes_sent":"690200","ConnERR":"0","ConnFree":"1","ConnOK":"1","ConnUsed":"0","Latency_us":"312","MaxConnUsed":"1","Queries":"23800","Queries_GTID_sync":"0","hostgroup":"2","srv_host":"int.mysql57-indra-mitra-1.dev.secondlife.io","srv_port":"3306","status":"ONLINE"},{"Bytes_recv":"69680","Bytes_sent":"38889","ConnERR":"6","ConnFree":"0","ConnOK":"1","ConnUsed":"1","Latency_us":"732","MaxConnUsed":"1","Queries":"1341","Queries_GTID_sync":"0","hostgroup":"2","srv_host":"int.mysql57-indra-mitra-666.dev.secondlife.io","srv_port":"3306","status":"SHUNNED"}]
2023-06-20 23:21:25 MySQL_Monitor.cpp:2890:monitor_ping(): [ERROR] Server int.mysql57-indra-mitra-666.dev.secondlife.io:3306 missed 3 heartbeats, shunning it and killing all the connections. Disabling other checks until the node comes back online.
2023-06-20 23:21:34 MySQL_Session.cpp:2819:handler_again___status_CONNECTING_SERVER(): [ERROR] Max connect timeout reached while reaching hostgroup 2 after 10000ms . HG status: [{"Bytes_recv":"1210200","Bytes_sent":"701916","ConnERR":"0","ConnFree":"0","ConnOK":"1","ConnUsed":"1","Latency_us":"363","MaxConnUsed":"1","Queries":"24204","Queries_GTID_sync":"0","hostgroup":"2","srv_host":"int.mysql57-indra-mitra-1.dev.secondlife.io","srv_port":"3306","status":"ONLINE"},{"Bytes_recv":"69680","Bytes_sent":"38889","ConnERR":"6","ConnFree":"0","ConnOK":"1","ConnUsed":"1","Latency_us":"732","MaxConnUsed":"1","Queries":"1341","Queries_GTID_sync":"0","hostgroup":"2","srv_host":"int.mysql57-indra-mitra-666.dev.secondlife.io","srv_port":"3306","status":"SHUNNED"}]
2023-06-20 23:21:55 MySQL_Monitor.cpp:2890:monitor_ping(): [ERROR] Server int.mysql57-indra-mitra-666.dev.secondlife.io:3306 missed 3 heartbeats, shunning it and killing all the connections. Disabling other checks until the node comes back online.
2023-06-20 23:22:05 MySQL_Session.cpp:2819:handler_again___status_CONNECTING_SERVER(): [ERROR] Max connect timeout reached while reaching hostgroup 2 after 10000ms . HG status: [{"Bytes_recv":"1218700","Bytes_sent":"706846","ConnERR":"0","ConnFree":"1","ConnOK":"1","ConnUsed":"0","Latency_us":"388","MaxConnUsed":"1","Queries":"24374","Queries_GTID_sync":"0","hostgroup":"2","srv_host":"int.mysql57-indra-mitra-1.dev.secondlife.io","srv_port":"3306","status":"ONLINE"},{"Bytes_recv":"69680","Bytes_sent":"38889","ConnERR":"6","ConnFree":"0","ConnOK":"1","ConnUsed":"1","Latency_us":"732","MaxConnUsed":"1","Queries":"1341","Queries_GTID_sync":"0","hostgroup":"2","srv_host":"int.mysql57-indra-mitra-666.dev.secondlife.io","srv_port":"3306","status":"SHUNNED"}]
MySQL [(none)]> select * from stats_mysql_connection_pool WHERE hostgroup=2\G
*************************** 1. row ***************************
        hostgroup: 2
         srv_host: int.mysql57-indra-mitra-1.dev.secondlife.io
         srv_port: 3306
           status: ONLINE
         ConnUsed: 0
         ConnFree: 1
           ConnOK: 1
          ConnERR: 0
      MaxConnUsed: 1
          Queries: 34589
Queries_GTID_sync: 0
  Bytes_data_sent: 1003081
  Bytes_data_recv: 1729450
       Latency_us: 343
*************************** 2. row ***************************
        hostgroup: 2
         srv_host: int.mysql57-indra-mitra-666.dev.secondlife.io
         srv_port: 3306
           status: SHUNNED
         ConnUsed: 0
         ConnFree: 0
           ConnOK: 1
          ConnERR: 6
      MaxConnUsed: 1
          Queries: 1341
Queries_GTID_sync: 0
  Bytes_data_sent: 38889
  Bytes_data_recv: 69680
       Latency_us: 732
2 rows in set (0.004 sec)
renecannao commented 1 year ago

Please provide:

Note: ProxySQL is designed to handle thousands (even a million) of client connections, not to handle a single client connection. If you use a single client connection you may experience behavior that are not typical of production environments.

jermlinden commented 1 year ago

docker.log proxysql.mitra.cnf.txt

Sorry for the delay. GH didn't like me uploading the config file as .cnf

Also note that I was making a new connection to proxysql each time, not reusing connections to it.

renecannao commented 1 year ago

I see at least 2 serious misconfiguration, but not only.

You configured hostgroup 2 with 2 servers and different weight: int.mysql57-indra-mitra-666.dev.secondlife.io : weight 1000 int.mysql57-indra-mitra-1.dev.secondlife.io : weight 1

I think this was an important information that wasn't mentioned. And it is not event matching in the config file...

So ProxySQL is configured:

Because proxysql is configured to desperately try to connect to int.mysql57-indra-mitra-666.dev.secondlife.io , after 5 errors it stun the server and it stops trying.

Now, the other very important relevant misconfiguration: You configured mysql-connect_timeout_server to be 10000 milliseconds , that is 10 seconds. I have a question here: what is your reason to configure mysql-connect_timeout_server to such a high value? To note, the default for mysql-connect_timeout_server is 1000 : if proxysql isn't able to connect to the backend in a reasonable amount of time, something is wrong. 10 seconds is not a reasonable amount of time, it is a lot.

Now, the misconfiguration alone it is not enough. You also have some network issue that allows to reach the timeout. In fact, when you shutdown the instance, in ProxySQL error log you will see a series of:

Failed to mysql_real_connect() on 2:int.mysql57-indra-mitra-666.dev.secondlife.io:3306 ... 2013: Lost connection to MySQL server at 'handshake: reading initial communication packet', system error: 111.

But after that there are no more failed mysql_real_connect() : the attempt to create a connection hang for long time , probably because AWS is blocking traffic to a not existing instance.

jermlinden commented 1 year ago

Thanks for taking a look René.

mysql-connect_timeout_server was copied from the old example config from when we first started using proxysql and hasn't been updated since. I'll look into updating it.

https://github.com/sysown/proxysql/commit/a8a7677c3d0cb78bda90af248ca48976664d73c6

The config file starts with 3 servers 0, 1, and 2.
On this instance of proxysql, which is not routing any live traffic, I deleted the -2 host and replaced it with a host I could kill without disrupting any live traffic like:

UPDATE mysql_servers SET hostname='int.mysql57-indra-mitra-666.dev.secondlife.io' WHERE hostname='int.mysql57-indra-mitra-2.dev.secondlife.io' AND hostgroup_id=2; LOAD MYSQL SERVERS TO RUNTIME;