Too many CLOSE_WAIT connections to backend mysql servers

sunit-saavn commented 4 years ago

I am seeing that there are a lot of backend connections to mysql servers lying in CLOSE_WAIT by proxysql.

> netstat -plant | grep -i "close_wait" | grep proxysql | wc -l
3284

Example of the connection from lsof: proxysql 1451 proxysql 8165u IPv4 11226130 0t0 TCP <VM>:37463-><server_ip>:mysql (CLOSE_WAIT)

The above number is at a time when we had already stopped sending any queries/connections to proxysql. This is the connection pool status on proxysql at that moment :

MySQL [(none)]> select * from stats_mysql_connection_pool \G
*************************** 1. row ***************************
        hostgroup: 1
         srv_host: <write_host>
         srv_port: 3306
           status: ONLINE
         ConnUsed: 0
         ConnFree: 10
           ConnOK: 186
          ConnERR: 0
      MaxConnUsed: 11
          Queries: 624633
Queries_GTID_sync: 0
  Bytes_data_sent: 36748519
  Bytes_data_recv: 77792067
       Latency_us: 1636
*************************** 2. row ***************************
        hostgroup: 2
         srv_host: <read_host>
         srv_port: 3306
           status: SHUNNED
         ConnUsed: 0
         ConnFree: 0
           ConnOK: 642
          ConnERR: 7440
      MaxConnUsed: 444
          Queries: 10724647
Queries_GTID_sync: 0
  Bytes_data_sent: 921711388
  Bytes_data_recv: 18133451013
       Latency_us: 2660

The free connection pct is 1, the max connections for the server is 1000 (I know it is high), the ping interval is 1 sec for backend conns and the wait time out at both mysql server (this is required because of legacy support) and proxysql is 5 sec. As you can see there are no client connections that are open to proxysql and no connections under use but still there are a lot of close_wait connections.

ProxySQL version: ProxySQL version 2.0.12-38-g58a909a, codename Truls OS version: CentOS Linux release 7.5.1804 (Core)

The steps to reproduce the issue: No exact steps that I know of, have encountered this situation first time and this is a setup similar to what we use someplace else. The only issue that I see in the logs is that there are a lot of connection/ping timeouts. But I don't expect that to be the cause of this as I assume that proxysql should be handling that gracefully.

The full ProxySQL error log (default location: /var/lib/proxysql/proxysql.log): https://drive.google.com/file/d/12oqpd0C2KiVrpS1QfiuaiFScvM10CsfH/view?usp=sharing

bdelvecchio commented 4 years ago

We have also been seeing this (on 2.0.10-27-g5b319972) in a large production system. Some unknown condition leads to proxySQL connection failures, rendering the database unreachable. This has happened a few times a month since March or so, but 4 times in the last 2 days.

From @sunit-saavn's logs:

2020-06-24 04:31:48 mysql_connection.cpp:947:handler(): [ERROR] Connect timeout on :3306 : exceeded by 7847us

We also see these messages, as well as timeouts for existing connections as well as connection errors between other containers running on the same nodes as ProxySQL (Orchestrator Raft, prometheus scraper). These connection timeouts (and conns in close_wait) could be a cascading condition as ProxySQL tries repeatedly to connect to mysql, until all the node's fds are exhausted.

Our interim solution has been to monitor the close_wait connection count on the nodes running ProxySQL and restart ProxySQL when the time_wait count starts climbing. This count will be at or close to zero for days or weeks, then spike to several thousand within a few minutes.

Happy to provide additional information

sunit-saavn commented 4 years ago

For some additional context, proxysql was trying to connect to backend mysql servers through a VPN tunnel and this situation kind of coincided with some issues on the VPN (bandwidth exhausted) where we suspect that the heartbeat pings failed on the backend from proxysql. But is this still something that is expected with connections lying around in CLOSE_WAIT because they won't be reclaimed unless proxysql closes them. @bdelvecchio Did you also see errors related to too many connections in the proxysql logs along with the timeout issues ? These errors have to be from proxysql because on the server side metrics connections never even reached close to the max connections. I am just trying to figure out if this is something expected or is this a leak of some kind from proxysql?

bdelvecchio commented 4 years ago

@bdelvecchio Did you also see errors related to too many connections in the proxysql logs along with the timeout issues ? These errors have to be from proxysql because on the server side metrics connections never even reached close to the max connections.

Yes, we have seen this as well.

Our cluster is Docker Swarm in EC2 with a Network Load Balancer in front of 2 ProxySQL instances.

bdelvecchio commented 4 years ago

@sunit-saavn do you have experience with this on previous versions? Was there a version before 2.0.12 that you feel confident did not have this problem? Or have you had only one instance of this, seemingly triggered by a VPN outage?

sunit-saavn commented 4 years ago

@bdelvecchio This is the first version of proxysql that we are using in production. So sadly can't comment about previous versions.

renecannao commented 4 years ago

@sunit-saavn : what status do you expect from a connection that wasn't even completed? According to error log, the connections were not established.

sunit-saavn commented 4 years ago

@renecannao If the connections are not completed shouldn't proxysql close the underlying opened connections rather than leaving them in close_wait state or is it application's responsibility to terminate these connections (I don't know if that's possible)?

renecannao commented 4 years ago

I don't follow your question. If a connection is not established, and proxysql did execute close() , what status do you expect the connection to be?

There is no issue

sunit-saavn commented 4 years ago

How long is this supposed to be in the close_wait state, because I have seen the connections to be in close_wait state for hours? Only after restart of proxysql do these connections get cleaned up. My understanding might be limited here but from this article close_wait is essentially : "This endpoint has received a close request from the remote endpoint and this TCP is now waiting for a connection termination request from the local application." I assume the local application here would be proxysql, if proxysql did execute close() because of any failures I don't understand why they would be in close_wait state for hours. If this is expected behaviour, then can you please point me to some documentation around this behaviour and can also close the issue.

bskllzh commented 4 years ago

@sunit-saavn ,you can try to configure this parameter: connection_max_age_ms=180000 in your configure file, or modify this parameter value that i provided on the runtime of proxysql.

bdelvecchio commented 4 years ago

@renecannao wrote:

If a connection is not established, and proxysql did execute close() , what status do you expect the connection to be?

This is not an accurate description of what's happening. Close Wait state indicates that proxysql received a FIN, and sent an ACK, but the connection is held open indefinitely because proxysql has not issued a close(). With proxysql creating these broken connections at high volumes (approximately one thousand per minute) we will exhaust the available file descriptors. This is why our only recovery path at this time is to restart proxysql.

The problem is not necessarily that existing connections get torn down--more that proxysql tries to create connections to the mysql backend while in this state, but these connections end up in Close Wait. I'm sorry that I can't describe the exact state that we're in at this point, but we've been trying to recreate it for months.

bdelvecchio commented 4 years ago

We had another incident Monday, and I'm trying to collect more useful information.

One thing to note: all 5179 of the connections in CLOSE_WAIT state have 86 bytes in the Recv-Q:

Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp       86      0 ip-172-17-0-6.ec2:38878 ip-10-247-1-11.ec:mysql CLOSE_WAIT
tcp       86      0 ip-172-17-0-6.ec2:44620 ip-10-247-1-11.ec:mysql CLOSE_WAIT

The count of CLOSE_WAIT connections in the proxysql container climbed in steps over the last 5 days, then spiked (+3205 in 2 minutes):

2020-07-02 00:00:01. Close Wait Socks = 7
2020-07-02 04:02:01. Close Wait Socks = 33
2020-07-02 04:04:01. Close Wait Socks = 108
2020-07-02 05:44:01. Close Wait Socks = 108
2020-07-02 05:46:01. Close Wait Socks = 248
2020-07-03 00:00:01. Close Wait Socks = 276
2020-07-04 00:00:01. Close Wait Socks = 588
2020-07-05 00:00:01. Close Wait Socks = 696
2020-07-06 00:00:02. Close Wait Socks = 1031
2020-07-07 00:00:02. Close Wait Socks = 1907
2020-07-07 01:20:01. Close Wait Socks = 1973
ERROR :  Close Wait Socks = 5178 ... Restarting proxy
ERROR: RESTARTED 740137d84123 at 2020-07-07 01:22:01

In the interval between these last 2 samples, we saw connection timeouts between orchestrator and mysql as well:

[mysql] 2020/07/07 01:20:34 packets.go:36: read tcp 172.17.0.3:38106->10.247.1.11:3306: i/o timeout
[mysql] 2020/07/07 01:20:34 packets.go:36: read tcp 172.17.0.3:38124->10.247.1.11:3306: i/o timeout

We started seeing aborted connections on mysql shortly before this: Screen Shot 2020-07-07 at 12 39 19 PM

Here's the view from proxysql: Screen Shot 2020-07-07 at 1 55 40 PM

402309718 commented 3 years ago

Hi,how to solve it I also got the error

haihq88 commented 3 years ago

Hi, I met same problem in my setup, i have proxysql-01 in app server, and have centralize proxysql-10 and proxysql-11 as backend servers of proxysql-01. In app server (mean proxysql-01) , there are more than 47820 close-wait connections, when reaching this value, app server could not open more connection to centralize proxysql-10 and proxysql-11. It caused error from laravel app mysql_connection.cpp:969:handler(): [ERROR] Failed to mysql_real_connect() on proxysql-10:6033 , FD (Conn:0 , MyDS:0) , 2002: Can't connect to MySQL server on 'proxysql-10' (99).

Workaround now is restart proxysql

alex-hohol commented 3 years ago

Hey! Having same problem with our proxysql ProxySQL version 2.0.15, over 7-10 days having approximately 25k CLOSE_WAIT connections. Same workaround as @haihq88 - restarting proxysql once in a week.

haihq88 commented 2 years ago

@alex-hohol restart proxysql will create downtime for your service. i have perl script to close all close-wait, it works in Linux (centos7 in my current environment). You can set it run in cron (daily for example).


#!/usr/bin/perl
# Originally created by mirage 2014
use strict;
use Socket;
use Net::RawIP;
use Net::Pcap;
use NetPacket::Ethernet qw(:strip);
use NetPacket::IP qw(:strip);
use NetPacket::TCP;
use POSIX qw(setsid);
use warnings;

open(my $CONNECTIONS_WAIT, "netstat -tulnap | grep CLOSE_WAIT | sed -e 's/::ffff://g' | awk '{print \$4,\$5}' | sed 's/:/ /g' |") || die "Failed: $!\n";

while ( my $conn = <$CONNECTIONS_WAIT> )
{
    chomp $conn;
    my ($src_ip, $src_port, $dst_ip, $dst_port) = split(' ', $conn);

    my $packet = Net::RawIP->new({
       ip => {  frag_off => 0, tos => 0,
       saddr => $dst_ip, daddr => $src_ip
       },
       tcp =>{  dest => $src_port, source => $dst_port,
       seq => 10, ack => 1
       }
       });
    $packet->send;
}

rineez-tenzymes commented 2 years ago

@haihq88 I think this perl script is a good idea, but looks like code formatting was not applied correctly on it. Could you please put the whole code inside code block?

haihq88 commented 2 years ago

@haihq88 I think this perl script is a good idea, but looks like code formatting was not applied correctly on it. Could you please put the whole code inside code block?

I updated the comment, the script can improve by changing the netstat to use ss and i hope we don't need to use that script and proxysql can fix the issue

teyeheimans commented 11 months ago

We have the same issue. More than 50k connections waiting to be closed.

We tried the perl script, but it did not work. We ended up using this one-liner:

ss --tcp state CLOSE-WAIT '( dport = 3306 )' --kill

sysown / proxysql

Too many CLOSE_WAIT connections to backend mysql servers #2908