MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout

GoogleCodeExporter commented 9 years ago

We just upgraded to memcached 1.4.18 using PHP 5.4.27 and the memcache 
extension 3.0.8. After the upgrade getting intermittent:

MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network 
timeout

Using the script suggested in the wiki: 
http://consoleninja.net/code/memcached/mc_conn_tester.pl immediately see 
timeouts:

➜  ~  ./mc_conn_tester.pl       
Fail: (timeout: 1) (elapsed: 1.00027108) (conn: 0.00198221) (set: 0.00000000) 
(get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00025678) (conn: 0.00038099) (set: 0.00000000) 
(get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00024295) (conn: 0.00049210) (set: 0.00000000) 
(get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00023890) (conn: 0.00036407) (set: 0.00000000) 
(get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00027013) (conn: 0.00049710) (set: 0.00000000) 
(get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00025988) (conn: 0.00034285) (set: 0.00000000) 
(get: 0.00000000)
^CAverages: (conn: 0.00070960) (set: 0.00047751) (get: 0.00041979)

Is this a bug in the new version of memcached?

Original issue reported on code.google.com by i...@nodesocket.com on 22 Apr 2014 at 10:33

GoogleCodeExporter commented 9 years ago

I know we talked it over in IRC a bit, but one more question:

does mc_conn_tester.pl always fail, or is it intermittent?

Original comment by dorma...@rydia.net on 23 Apr 2014 at 12:40

GoogleCodeExporter commented 9 years ago

It always failed, but after restarting memcached so far no problems. Will 
advise if anything comes up again.

Original comment by notifica...@commando.io on 23 Apr 2014 at 1:46

GoogleCodeExporter commented 9 years ago

I'm deeply suspicious if mc_conn_tester.pl is failing 100% but you can 
successfully telnet to it :/

Original comment by dorma...@rydia.net on 23 Apr 2014 at 1:48

GoogleCodeExporter commented 9 years ago

The memcached timeout issues are back on the same host, here is the output of 
mc_conn_tester.pl.

➜  ~  ./mc_conn_tester.pl 
Fail: (timeout: 1) (elapsed: 1.00027609) (conn: 0.00065994) (set: 0.00000000) 
(get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00024891) (conn: 0.00054693) (set: 0.00000000) 
(get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00022912) (conn: 0.00031996) (set: 0.00000000) 
(get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00024796) (conn: 0.00060511) (set: 0.00000000) 
(get: 0.00000000)
^CAverages: (conn: 0.00058240) (set: 0.00062472) (get: 0.00041881)

Also, seeing warnings thrown from PHP:

E_NOTICE: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: 
Network timeout (0)

What else can I get you to help assist? Thanks.

Original comment by i...@nodesocket.com on 23 Apr 2014 at 5:17

GoogleCodeExporter commented 9 years ago

I had to restart memcached, but of the errors (this is a production host), 
however I ran:

stats and stats cons

Before restarting: https://gist.github.com/nodesocket/fe822ef51eb0abfb8f9b

Hopefully that helps debug this.

Original comment by i...@nodesocket.com on 23 Apr 2014 at 5:28

GoogleCodeExporter commented 9 years ago

I'm really concerned that you're getting mc_conn_tester to fail each time but 
can still connect via telnet. That should be impossible.

I see there's a timing value for conn: but not set:, which means it's timing 
out on the set command.

If it happens again, can you telnet and try some set or get commands?

set foo 0 0 2
hi

(hit enter twice after typing hi). Or catch me in IRC.

If it does happen again, can you attach a gdb session to it and get a stack 
trace?

gdb -p $(pidof memcached)
thread apply all bt

.. that'll show if something is holding onto a lock, hopefully.

Original comment by dorma...@rydia.net on 24 Apr 2014 at 2:19

GoogleCodeExporter commented 9 years ago

Will do, I expect this issue to popup sometime tomorrow. I have gdb ready when 
it does.

Original comment by jus...@commando.io on 24 Apr 2014 at 6:15

GoogleCodeExporter commented 9 years ago

no luck?

Original comment by dorma...@rydia.net on 26 Apr 2014 at 6:28

GoogleCodeExporter commented 9 years ago

So far no errors, very strange. Will keep you posted, thanks for the help, 
really appreciated.

Original comment by i...@nodesocket.com on 26 Apr 2014 at 8:12

GoogleCodeExporter commented 9 years ago

I'm planning on releasing 1.4.19 sometime tomorrow. So far I haven't had any 
other reports of a similar issue (~5,000 downloads of .18). If you get some gdb 
info before then I'll try to sneak in a fix.

Still pretty suspicious of it not being our issue. The connection code changed 
a bit but you wouldn't be able to telnet in if that broke... having the sets 
hang is pretty bizarre.

but, I won't rule it out until we see more of what's going on with you.

Original comment by dorma...@rydia.net on 28 Apr 2014 at 7:26

GoogleCodeExporter commented 9 years ago

Speak of the devil. Just got an alert:

E_NOTICE: MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: 
Network timeout (0)

Also running mc_conn_tester.pl on the host in question is failing again:

➜  ~  ./mc_conn_tester.pl 
Fail: (timeout: 1) (elapsed: 1.00025582) (conn: 0.00079894) (set: 0.00000000) 
(get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00021482) (conn: 0.00055695) (set: 0.00000000) 
(get: 0.00000000)
Fail: (timeout: 1) (elapsed: 1.00023007) (conn: 0.00053406) (set: 0.00000000) 
(get: 0.00000000)
^CAverages: (conn: 0.00052620) (set: 0.00053876) (get: 0.00034102)

What would you like me to do?

Original comment by notifica...@commando.io on 28 Apr 2014 at 9:40

GoogleCodeExporter commented 9 years ago

Just more notes: we worked this out a bit in IRC.

He's not running from a source build... which I think should have (some?) line 
info.

If it doesn't, he'll have to start it again with the memcached-debug binary.

Before you do that though, try to catch me in IRC. We can do 
instruction-stepping if the line numbering is still missing... it'll just be a 
lot more painful to look at.

I've also e-mailed steven in case he has any ideas, but I haven't had time to 
stare more at the code today (and might not, busy this week).

Original comment by dorma...@rydia.net on 28 Apr 2014 at 10:54

GoogleCodeExporter commented 9 years ago

Thanks for the update. I am now running from source from the home directory. 
When it happens again, I'll jump into IRC and we try instruction-stepping.

Thanks.

Original comment by notifica...@commando.io on 28 Apr 2014 at 11:51

GoogleCodeExporter commented 9 years ago

heh.. if I'm not in IRC you might want to update here :P

I made a few more mistakes... I should've asked you to save a core, and also 
run gdb a few times.

However from what I see in backlog is useful: there's a line number!

That line is:
        switch(c->state) {

... which might seem useless, and would be a lot better with some stepping or 
printing of variables (really wishing I knew what c->state is..). GDB is a 
skill you should pick up.

Anyway, between 1.4.17 and 1.4.18 one new c->state was added to the routine, 
conn_closed. Under the debug binary that assert's. Under the non-debug binary 
however it'll endlessly loop. So pretty high odds a connection is ending up in 
conn_closed state within there.

Find me in IRC and we'll walk through a quick patch to confirm this. (or if it 
happens again before you see this, try to find me).

thanks!

Original comment by dorma...@rydia.net on 29 Apr 2014 at 4:51

GoogleCodeExporter commented 9 years ago

So it turned out that c->state was set to "conn_closed".

Looking at this a bit more closely:

conn_closed is only ever set from conn_close().

conn_close() is only called from two spots: once, before drive_machine() is 
entered (and early returned from), and once from within drive_machine(), 
directly above the conn_closed case and with a stop/break before it:

        case conn_closing:
            if (IS_UDP(c->transport))
                conn_cleanup(c);
            else
                conn_close(c);
            stop = true;
            break;

        case conn_closed:

... conn_close() always deletes the event from the stack, closes the 
filehandle, etc.

So, I don't see how this could happen... yet...

None of the code changed between 1.4.17 or 1.4.18 seems to add new paths which 
could cause a connection to re-fire.

If conn_close() is called there's no way for that to loop again (stop = true).

What would have to happen is the event_handler firing again, on the closed 
connection, which the fd for is closed, the event is deleted, but the memory 
not reused just yet... It would then enter drive_machine with the state already 
set to conn_closing, and never trip a stop = true, and not assert since it's 
not a debug binary.

Which is fucking terrifying. If this happened in the old code it'd just keep 
running into conn_closing and re-closing itself. though I was pretty sure that 
calling event_del() twice causes a crash.

The other possibility is that a UDP socket is getting closed... except that 
also deletes the event, and closes the socket, so no new events should happen.

I did a quick test and added a second conn_close() call and.... it doesn't 
cause a crash. it causes the curr_connections counter to slowly drift. That is 
wild.

I just pushed: 
https://github.com/memcached/memcached/commit/ee961e456457728ba78057961eca357eda
ea1ec1

...up to master.

I'm still a bit suspicious that I'm missing something important here... so I'm 
not doing a release tonight, but I might do one early tomorrow anyway.

Reporter is currently running a version of this patch in production; if his 
thing hangs up again, or doesn't self-recover once hitting the condition, we'll 
have a better idea I guess.

Original comment by dorma...@rydia.net on 30 Apr 2014 at 7:44

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

Er clarifying some typos:

"It would then enter drive_machine() with the state already in conn_closed"

-> in the old code, conn_closing is the final state (not edited again during a 
close() call), so if it refired it'd immediately drop back through the closing 
bits and re-close itself.

we're probably just not using libevent right. I bet fixing this bug makes 
maxconns_fast a bit more reliable... I've never seen someone with a negative 
curr_connections counter though. So I'm still suspicious.

Original comment by dorma...@rydia.net on 30 Apr 2014 at 7:54

GoogleCodeExporter commented 9 years ago

Been a bit under a day. Any output in your screen session yet?

Original comment by dorma...@rydia.net on 30 Apr 2014 at 9:06

GoogleCodeExporter commented 9 years ago

Nothing yet, no output in the screen process.

Original comment by i...@nodesocket.com on 1 May 2014 at 12:08

GoogleCodeExporter commented 9 years ago

Turned out the upstream packager somehow built .18 against libevent-1.4.13 
despite eariler versions being against 2.x. It's probably just a broken build.

Further fiddling with gdb showed that the event_del() routine had either not 
happened or happened incorrectly, so it was still linked in the libevent queue. 
The patch I put in prevented it from causing hangs but it still spun cpu. I 
might leave that in just in case.

I'd been over the code a bunch of times and it seemed pretty impossible... so 
that's probably why.

but it seems like it's not our bug. leaving this open for a few more days just 
in case though.

Original comment by dorma...@rydia.net on 1 May 2014 at 10:05

GoogleCodeExporter commented 9 years ago

Any repeat crashes? I'm going to close this. it looks like remi shipped .19. 
reopen or open a new one if it hangs in the same way somehow...

Well. 19 won't be printing anything, and it won't hang, but if it's actually 
our bug and not libevent it would end up spinning CPU. Keep an eye out I guess.

Original comment by dorma...@rydia.net on 3 May 2014 at 5:20

Changed state: Invalid

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Hi, I have exactly same problem described here. My error message is:

MemcachePool::set(): Server host.domain.net (tcp 11211, udp 0) failed with: 
Network timeout (0)

I'm using latest PHP 5.5.12, memcache 3.0.8 from remi repo on Centos 6.5. We 
have no problems with Memcached version 1.4.4 from repo base, If i want to 
upgrade Memcached from remi repo I got this message since version 1.4.17 till 
1.4.20. It is not related to network issues.

I have two versions of libevent on system:

libevent.x86_64                        1.4.13-4.el6                   @base     
libevent-last.x86_64                   2.0.21-4.el6.remi              @remi    

Regards,

Pavel

Original comment by pavel.hl...@profimedia.cz on 21 May 2014 at 9:58

GoogleCodeExporter commented 9 years ago

This was a race condition fixed with:

https://github.com/memcached/memcached/commit/e73bc2e5c0794cccd6f8ece63bc16433c4
0ed766

Upgrade to 1.4.20. Works for us.

Original comment by i...@nodesocket.com on 21 May 2014 at 2:52

liuqian1990 / memcached

MemcachePool::get(): Server 127.0.0.1 (tcp 11211, udp 0) failed with: Network timeout #363