Open sunit-saavn opened 4 years ago
We have also been seeing this (on 2.0.10-27-g5b319972) in a large production system. Some unknown condition leads to proxySQL connection failures, rendering the database unreachable. This has happened a few times a month since March or so, but 4 times in the last 2 days.
From @sunit-saavn's logs:
2020-06-24 04:31:48 mysql_connection.cpp:947:handler(): [ERROR] Connect timeout on
:3306 : exceeded by 7847us
We also see these messages, as well as timeouts for existing connections as well as connection errors between other containers running on the same nodes as ProxySQL (Orchestrator Raft, prometheus scraper). These connection timeouts (and conns in close_wait) could be a cascading condition as ProxySQL tries repeatedly to connect to mysql, until all the node's fds are exhausted.
Our interim solution has been to monitor the close_wait connection count on the nodes running ProxySQL and restart ProxySQL when the time_wait count starts climbing. This count will be at or close to zero for days or weeks, then spike to several thousand within a few minutes.
Happy to provide additional information
For some additional context, proxysql was trying to connect to backend mysql servers through a VPN tunnel and this situation kind of coincided with some issues on the VPN (bandwidth exhausted) where we suspect that the heartbeat pings failed on the backend from proxysql. But is this still something that is expected with connections lying around in CLOSE_WAIT because they won't be reclaimed unless proxysql closes them. @bdelvecchio Did you also see errors related to too many connections in the proxysql logs along with the timeout issues ? These errors have to be from proxysql because on the server side metrics connections never even reached close to the max connections. I am just trying to figure out if this is something expected or is this a leak of some kind from proxysql?
@bdelvecchio Did you also see errors related to too many connections in the proxysql logs along with the timeout issues ? These errors have to be from proxysql because on the server side metrics connections never even reached close to the max connections.
Yes, we have seen this as well.
Our cluster is Docker Swarm in EC2 with a Network Load Balancer in front of 2 ProxySQL instances.
@sunit-saavn do you have experience with this on previous versions? Was there a version before 2.0.12 that you feel confident did not have this problem? Or have you had only one instance of this, seemingly triggered by a VPN outage?
@bdelvecchio This is the first version of proxysql that we are using in production. So sadly can't comment about previous versions.
@sunit-saavn : what status do you expect from a connection that wasn't even completed? According to error log, the connections were not established.
@renecannao If the connections are not completed shouldn't proxysql close the underlying opened connections rather than leaving them in close_wait state or is it application's responsibility to terminate these connections (I don't know if that's possible)?
I don't follow your question.
If a connection is not established, and proxysql did execute close()
, what status do you expect the connection to be?
There is no issue
How long is this supposed to be in the close_wait state, because I have seen the connections to be in close_wait state for hours? Only after restart of proxysql do these connections get cleaned up. My understanding might be limited here but from this article close_wait is essentially : "This endpoint has received a close request from the remote endpoint and this TCP is now waiting for a connection termination request from the local application." I assume the local application here would be proxysql, if proxysql did execute close() because of any failures I don't understand why they would be in close_wait state for hours. If this is expected behaviour, then can you please point me to some documentation around this behaviour and can also close the issue.
@sunit-saavn ,you can try to configure this parameter: connection_max_age_ms=180000 in your configure file, or modify this parameter value that i provided on the runtime of proxysql.
@renecannao wrote:
If a connection is not established, and proxysql did execute
close()
, what status do you expect the connection to be?
This is not an accurate description of what's happening. Close Wait state indicates that proxysql received a FIN, and sent an ACK, but the connection is held open indefinitely because proxysql has not issued a close(). With proxysql creating these broken connections at high volumes (approximately one thousand per minute) we will exhaust the available file descriptors. This is why our only recovery path at this time is to restart proxysql.
The problem is not necessarily that existing connections get torn down--more that proxysql tries to create connections to the mysql backend while in this state, but these connections end up in Close Wait. I'm sorry that I can't describe the exact state that we're in at this point, but we've been trying to recreate it for months.
We had another incident Monday, and I'm trying to collect more useful information.
One thing to note: all 5179 of the connections in CLOSE_WAIT state have 86 bytes in the Recv-Q:
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 86 0 ip-172-17-0-6.ec2:38878 ip-10-247-1-11.ec:mysql CLOSE_WAIT
tcp 86 0 ip-172-17-0-6.ec2:44620 ip-10-247-1-11.ec:mysql CLOSE_WAIT
The count of CLOSE_WAIT connections in the proxysql container climbed in steps over the last 5 days, then spiked (+3205 in 2 minutes):
2020-07-02 00:00:01. Close Wait Socks = 7
2020-07-02 04:02:01. Close Wait Socks = 33
2020-07-02 04:04:01. Close Wait Socks = 108
2020-07-02 05:44:01. Close Wait Socks = 108
2020-07-02 05:46:01. Close Wait Socks = 248
2020-07-03 00:00:01. Close Wait Socks = 276
2020-07-04 00:00:01. Close Wait Socks = 588
2020-07-05 00:00:01. Close Wait Socks = 696
2020-07-06 00:00:02. Close Wait Socks = 1031
2020-07-07 00:00:02. Close Wait Socks = 1907
2020-07-07 01:20:01. Close Wait Socks = 1973
ERROR : Close Wait Socks = 5178 ... Restarting proxy
ERROR: RESTARTED 740137d84123 at 2020-07-07 01:22:01
In the interval between these last 2 samples, we saw connection timeouts between orchestrator and mysql as well:
[mysql] 2020/07/07 01:20:34 packets.go:36: read tcp 172.17.0.3:38106->10.247.1.11:3306: i/o timeout
[mysql] 2020/07/07 01:20:34 packets.go:36: read tcp 172.17.0.3:38124->10.247.1.11:3306: i/o timeout
We started seeing aborted connections on mysql shortly before this:
Here's the view from proxysql:
Hi,how to solve it I also got the error
Hi, I met same problem
in my setup, i have proxysql-01 in app server, and have centralize proxysql-10 and proxysql-11 as backend servers of proxysql-01.
In app server (mean proxysql-01) , there are more than 47820 close-wait connections, when reaching this value, app server could not open more connection to centralize proxysql-10 and proxysql-11. It caused error from laravel app
mysql_connection.cpp:969:handler(): [ERROR] Failed to mysql_real_connect() on proxysql-10:6033 , FD (Conn:0 , MyDS:0) , 2002: Can't connect to MySQL server on 'proxysql-10' (99).
Workaround now is restart proxysql
Hey! Having same problem with our proxysql ProxySQL version 2.0.15, over 7-10 days having approximately 25k CLOSE_WAIT connections. Same workaround as @haihq88 - restarting proxysql once in a week.
@alex-hohol restart proxysql will create downtime for your service. i have perl script to close all close-wait, it works in Linux (centos7 in my current environment). You can set it run in cron (daily for example).
#!/usr/bin/perl
# Originally created by mirage 2014
use strict;
use Socket;
use Net::RawIP;
use Net::Pcap;
use NetPacket::Ethernet qw(:strip);
use NetPacket::IP qw(:strip);
use NetPacket::TCP;
use POSIX qw(setsid);
use warnings;
open(my $CONNECTIONS_WAIT, "netstat -tulnap | grep CLOSE_WAIT | sed -e 's/::ffff://g' | awk '{print \$4,\$5}' | sed 's/:/ /g' |") || die "Failed: $!\n";
while ( my $conn = <$CONNECTIONS_WAIT> )
{
chomp $conn;
my ($src_ip, $src_port, $dst_ip, $dst_port) = split(' ', $conn);
my $packet = Net::RawIP->new({
ip => { frag_off => 0, tos => 0,
saddr => $dst_ip, daddr => $src_ip
},
tcp =>{ dest => $src_port, source => $dst_port,
seq => 10, ack => 1
}
});
$packet->send;
}
@haihq88 I think this perl script is a good idea, but looks like code formatting was not applied correctly on it. Could you please put the whole code inside code block?
@haihq88 I think this perl script is a good idea, but looks like code formatting was not applied correctly on it. Could you please put the whole code inside code block?
I updated the comment, the script can improve by changing the netstat to use ss and i hope we don't need to use that script and proxysql can fix the issue
We have the same issue. More than 50k connections waiting to be closed.
We tried the perl script, but it did not work. We ended up using this one-liner:
ss --tcp state CLOSE-WAIT '( dport = 3306 )' --kill
I am seeing that there are a lot of backend connections to mysql servers lying in CLOSE_WAIT by proxysql.
Example of the connection from
lsof
:proxysql 1451 proxysql 8165u IPv4 11226130 0t0 TCP <VM>:37463-><server_ip>:mysql (CLOSE_WAIT)
The above number is at a time when we had already stopped sending any queries/connections to proxysql. This is the connection pool status on proxysql at that moment :
The free connection pct is 1, the max connections for the server is 1000 (I know it is high), the ping interval is 1 sec for backend conns and the wait time out at both mysql server (this is required because of legacy support) and proxysql is 5 sec. As you can see there are no client connections that are open to proxysql and no connections under use but still there are a lot of close_wait connections.
ProxySQL version: ProxySQL version 2.0.12-38-g58a909a, codename Truls OS version: CentOS Linux release 7.5.1804 (Core)
The steps to reproduce the issue: No exact steps that I know of, have encountered this situation first time and this is a setup similar to what we use someplace else. The only issue that I see in the logs is that there are a lot of connection/ping timeouts. But I don't expect that to be the cause of this as I assume that proxysql should be handling that gracefully.
The full ProxySQL error log (default location:
/var/lib/proxysql/proxysql.log
): https://drive.google.com/file/d/12oqpd0C2KiVrpS1QfiuaiFScvM10CsfH/view?usp=sharing