rfminelli / lusca-cache

Automatically exported from code.google.com/p/lusca-cache
GNU General Public License v2.0
0 stars 0 forks source link

FreeBSD - seeing ECONNABORTED in cache.log #37

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago

2009/07/05 15:01:30| httpAccept: FD 58: accept failure: (53) Software
caused connection abort

Original issue reported on code.google.com by adrian.c...@gmail.com on 5 Jul 2009 at 2:32

GoogleCodeExporter commented 8 years ago
A bit of snooping on the freebsd developer channel suggested that the tcp 
timewait
zone is being overflowed.

Indeed:

squid-1#  vmstat -z | head -1 ; vmstat -z | grep -i tcptw
ITEM                     SIZE     LIMIT      USED      FREE  REQUESTS  FAILURES
tcptw:                     88,     8232,      873,     7359, 118015778,     5479

The suggestion is to bump "net.inet.tcp.maxtcptw" from what it is (8191 at the
moment) to something higher and see if the issue goes away.

Original comment by adrian.c...@gmail.com on 5 Jul 2009 at 2:34

GoogleCodeExporter commented 8 years ago
Its still occuring.

Its not listen queue overflows: 

squid-1# netstat -sp tcp | grep -i listen
        0 listen queue overflows

Original comment by adrian.c...@gmail.com on 5 Jul 2009 at 2:37

GoogleCodeExporter commented 8 years ago
From a diff of netstat -sp tcp after 10 seconds; this is under "packets 
received":

-               26350808 discarded due to memory problems
+               26351094 discarded due to memory problems

Let's track down that particular counter in the tcp/ip statistics code and see 
what
exactly is responsible for this counter.

Original comment by adrian.c...@gmail.com on 5 Jul 2009 at 2:46

GoogleCodeExporter commented 8 years ago
That counter is part of the TCP reassembly code.

Check:
squid-1# sysctl net.inet.tcp.reass
net.inet.tcp.reass.overflows: 26398434
net.inet.tcp.reass.maxqlen: 48
net.inet.tcp.reass.cursegments: 267
net.inet.tcp.reass.maxsegments: 16384

net.inet.tcp.reass.overflows has been steadily rising. maxqlen has been bumped 
to 256
with no (current) adverse effect, but I wonder what else needs to be bumped. 
What
about nmbclusters?

In any case, this still hasn't helped with ECONNABORTED.

Original comment by adrian.c...@gmail.com on 5 Jul 2009 at 3:08

GoogleCodeExporter commented 8 years ago
Something to look at tomorrow morning.

comm_call_handlers() will call the read handler if read_event is 1 (under the 
right circumstances) but what 
about if its -1?

do_check_incoming() is invoked in various places, where it calls 
do_call_incoming() which calls 
comm_call_handlers(fd, -1, -1). This means that accept() is going to be 
attempted a -whole lot- of times 
even if it isn't currently isn't flagged to be checked. How valid is this 
exactly? In theory, accept() should just 
return a shiny non-fatal error if no FDs are ready but is this -truely- going 
to be the case here with Squid?

Original comment by adrian.c...@gmail.com on 5 Jul 2009 at 7:55

GoogleCodeExporter commented 8 years ago
Also, grovelling around the kernel code has provided some potential gems.

There's only a few places where ECONNABORTED is returned.

http://fxr.watson.org/fxr/ident?v=FREEBSD7;im=excerpts;i=ECONNABORTED

It may be worthwhile just hacking the kernel up to have printf()'s in the 
places where this value is set and just 
run the proxy in production for a few minutes to see which code paths lead to 
connections being aborted like 
this.

The way it is bursty makes me wonder exactly what the root cause of the issue 
is most likely to be. It could be 
a local resource starvation issue on the box. It could be something upstream 
(eg NAT gateway?) getting 
wholly upset with the session counts?

Another thing I've been pondering is given the server is also spoofing client 
IPs as well as server-side IPs, are 
there any PCB hash collisions? That certainly needs to be investigated.

Original comment by adrian.c...@gmail.com on 5 Jul 2009 at 8:31