socketio / socket.io

Realtime application framework (Node.JS server)
https://socket.io
MIT License
60.88k stars 10.09k forks source link

High rate of "client not handshaken should reconnect" #438

Closed squidfunk closed 10 years ago

squidfunk commented 13 years ago

I am running a chat server with node.js / socket.io and have a lot of "client not handshaken" warnings. In the peak time there are around 1.000 to 3.000 open TCP connections.

For debugging purposes I plotted the graph of actions succeeding the server-side "set close timeout" event, because the warnings are always preceded by those, so the format is:

Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 2098080741242069807
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - xhr-polling closed due to exceeded duration
--
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 330973265416677743
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - setting request GET /socket.io/1/xhr-polling
--
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 10595896332140683620
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - cleared close timeout for client 10595896332140683620
--
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 21320636051749821863
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - cleared close timeout for client 21320636051749821863
--
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   debug - set close timeout for client 3331715441803393577
Mon Aug 01 2011 08:16:01 GMT+0200 (CEST)   warn  - client not handshaken client should reconnect

The following plot explained:

Plot Full size

I did not change the default timeouts and intervals provided by socket.io, but I think it is very strange, that there is a peak of handshake errors at around 10 seconds (even surpassing the successful cleared close timeouts!). Did anyone experience a similar situation?

Best regards, Martin

denisu commented 13 years ago

Hi,

were you able to solve it? I am still having this problem with 0.7.8. I am not able to reproduce it on my machine, but i can see it in the debug logs, some clients go really crazy (looks like more than 50 reconnects/second). I have this problem only with jsonp and xhr connections, turning off one of them didn't help though.

   debug - setting request GET /socket.io/1/jsonp-polling/487577450665437510?t=1312872393095&i=1
   debug - setting poll timeout
   debug - clearing poll timeout
   debug - jsonppolling writing io.j[1]("7:::1+0");
   debug - set close timeout for client 487577450665437510
   warn  - client not handshaken client should reconnect
   info  - transport end
   debug - cleared close timeout for client 487577450665437510
   debug - discarding transport
   debug - setting request GET /socket.io/1/xhr-polling/487577450665437510
   debug - setting poll timeout
   debug - clearing poll timeout
   debug - xhr-polling writing 7:::1+0
   debug - set close timeout for client 487577450665437510
   warn  - client not handshaken client should reconnect
   info  - transport end
   debug - cleared close timeout for client 487577450665437510
   debug - discarding transport
   debug - setting request GET /socket.io/1/jsonp-polling/487577450665437510?t=1312872393150&i=1
   debug - setting poll timeout
   debug - clearing poll timeout
   debug - jsonppolling writing io.j[1]("7:::1+0");
   debug - set close timeout for client 487577450665437510
   warn  - client not handshaken client should reconnect
   info  - transport end
   debug - cleared close timeout for client 487577450665437510
   debug - discarding transport
   debug - setting request GET /socket.io/1/xhr-polling/487577450665437510
   debug - setting poll timeout
   debug - clearing poll timeout
   debug - xhr-polling writing 7:::1+0
   debug - set close timeout for client 487577450665437510
   warn  - client not handshaken client should reconnect
   ...

I am using custom namespaces btw.

denisu commented 13 years ago

TCP dump of one of those terror-connections, if that helps (this goes on continously).

10:21:13.450665 IP (removed).8433 > (removed).65471: Flags [P.], seq 1973377331:1973377336, ack 537321861, win 54, length 5
@.@....(!....E ...u.Y3 ...P..6L....2::.
10:21:13.480742 IP 77.12.111.190.64720 > 188.40.33.215.8433: Flags [P.], seq 29040:29700, ack 6557, win 4169, length 660
....E...V.@.z...M.o..(!... ..%.)M...P..I^p..GET /socket.io/1/jsonp-polling/12189419471609411629?t=1312878072071&i=1 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: */*
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Cookie: (removed)

10:21:13.481279 IP (removed).8433 > (removed).64720: Flags [P.], seq 6557:6706, ack 29700, win 9911, length 149
...$!).q..E.....@.@....(!.M.o. ...M....%..P.&..7..HTTP/1.1 200 OK
Content-Type: text/javascript; charset=UTF-8
Content-Length: 19
Connection: Keep-Alive
X-XSS-Protection: 0

io.j[1]("7:::1+0");
10:21:13.504212 IP (removed).64725 > (removed).8433: Flags [P.], seq 21428:21915, ack 6249, win 4356, length 487
.M.o..(!... ....\W,..P.......GET /socket.io/1/xhr-polling/12189419471609411629 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Origin: (removed)

10:21:13.504593 IP 188.40.33.215.8433 > 77.12.111.190.64725: Flags [P.], seq 6249:6391, ack 21915, win 7846, length 142
...$!).q..E.....@.@....(!.M.o. ...W,.....CP.......HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8
Content-Length: 7
Connection: Keep-Alive
Access-Control-Allow-Origin: *

7:::1+0
10:21:13.542058 IP (removed).64720 > (removed).8433: Flags [P.], seq 29700:30360, ack 6706, win 4132, length 660
....E...V.@.z...M.o..(!... ..%..M.._P..$Ro..GET /socket.io/1/jsonp-polling/12189419471609411629?t=1312878072149&i=1 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: */*
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Cookie: (removed)

10:21:13.542557 IP (removed).8433 > (removed).64720: Flags [P.], seq 6706:6855, ack 30360, win 9921, length 149
...$!).q..E.....@.@....(!.M.o. ...M.._.%.QP.&.....HTTP/1.1 200 OK
Content-Type: text/javascript; charset=UTF-8
Content-Length: 19
Connection: Keep-Alive
X-XSS-Protection: 0

io.j[1]("7:::1+0");
10:21:13.567452 IP (removed).64725 >(removed)8433: Flags [P.], seq 21915:22402, ack 6391, win 4320, length 487
.M.o..(!... ....CW,.^P.......GET /socket.io/1/xhr-polling/12189419471609411629 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Origin: (removed)
squidfunk commented 13 years ago

Hi denisu,

actually for the last 2 weeks these high rate of handshake warnings was not as important as the crashes of my client's chat server due to memory leaks. These we're hopefully fixed with the latest pull of 3rd-eden into the master. In the next days I will be at my client's site again to check whether the chat server has crashed and investigate those handshaken warnings again.

For now they don't seem to be very severe (at least not as severe as the crashes). I will keep you updated here.

denisu commented 13 years ago

Hi, thank you for the reply!

You are right, its not that critical. The server can handle this high rate of connections easily. I hope it doesn't cause any problems on the client side. I still wasn't able to reproduce it, but until now I had this problem with clients using Firefox 5.0, Opera 11 and IE 8.0 over the xhr and jsonp transports.

I will let you know if I found out more.

gasteve commented 13 years ago

I've been having problems with this too...wondering whether there's something I should be doing on the client side that I'm not.

denisu commented 13 years ago

Yeah, for about 500 connected clients, there are 2 to 5 clients constantly reconnecting (absolutely no delay). Clients with a super fast internet connection were reconnecting so fast that I had to block their IP in the firewall, since it was affecting the server.

@squidfunk can you reopen this issue?

squidfunk commented 13 years ago

I tried to investigate the problem but didn't find anything so far. However It may probably related to another problem, which is caused by the fact that socket.io doesn't enable Flashsocket transport by default. This leads to a problem in IE 6 and 7 when the chat server is not on the same domain as the client script, as IE 6 and 7 seem to prohibit cross-origin long polling (no problem with IE 8 though). Therefore it is not possible with IE 6 or 7 to connect. Maybe this is related.

I reopened it. Actually I closed it because I thought I was the only one experiencing the problem and that it may be too specific. But it doesn't seem to be.

rafaelbrizola commented 13 years ago

This problem happened to me when I'm using Firefox 3.6.18 and change the codification from UTF-8 to ISO-8859-1.

squidfunk commented 13 years ago

@rafaelbrizola: can you give a little more information? Did the "client not handshaken" warnings increase or weren't you be able to connect at all? Actually my chat server is running (has to run) in an ISO-8859-1 environment (sadly enough).

denisu commented 13 years ago

My app runs in an ISO-8859-1 environment too, and the node server is not on the same domain as the client script. Maybe this could be some hint on whats wrong :). But I have seen all common browsers (except the ones that support Websockets) having this problem. Still investigating...

akostibas commented 13 years ago

I've been experiencing the same thing since moving to 0.7.7 (now on 0.7.9). Also with a cross-domain Node/webserver setup. The biggest problem is that I have to set the log level to 0 to avoid filling up the tiny EC2 disk.

Let me know if I can provide anything useful for debugging.

cowboyrushforth commented 12 years ago

Hi there,

Also experiencing this issue intermittently with brand new deployment of socket.io 8.4.

Seems to occur 20-30 minutes after server start. Until then, everything is normal, everyone is happy.

But after about 20-30 minutes in, 90% of the messages in the log are simply 'client not handshaken should reconnect'.

Once node instances are restarted ( we are running 8 of them ), then the issues goes away, until it comes back 20-30 minutes later.

For what its worth, we are load balancing (with ip based session persistence) all 8 of the node processes.

Would be happy to provide any more details on our setup. Cheers, and thanks for Socket.IO! Awesome!

3rd-Eden commented 12 years ago

I suspected that changing https://github.com/LearnBoost/socket.io/blob/master/lib/transport.js#L157 from .end to .close But I havent been able to verify my finds yet. But I also see the same rate of requests coming in.

But it's combination of fail from the server and fail from the client side code.

cowboyrushforth commented 12 years ago

@3rd-Eden - I have put your suggested fix into production for almost 45 minutes now, and rate of 'client not handshaken...' messages has significantly dropped. There is still occasionally these messages coming in, but they are about equal to log messages of real actual message passing activity. I will continue to watch this issue for a bit, but you may have solved it.

Thanks a million! Will keep this thread up to date if my situation changes.

cowboyrushforth commented 12 years ago

I may have spoke a bit too soon. Given enough time, it seems that all the node instances do still fall back into the loop of client not handshaken messages. But for some reason it seems as though it lasts longer because they need to be restarted.

joefaron commented 12 years ago

I'm seeing this issue as well with about 700 connected clients.. How do you re-connect?

My log is flooded with: warn - client not handshaken client should reconnect warn - client not handshaken client should reconnect warn - client not handshaken client should reconnect warn - client not handshaken client should reconnect warn - client not handshaken client should reconnect warn - client not handshaken client should reconnect warn - client not handshaken client should reconnect warn - client not handshaken client should reconnect

tommypowerz commented 12 years ago

having the same poblem... also about 500-1000 connections...

but it is also crashing with this from time to time.. :

node: src/uv-common.c:92: uv_err_name: Assertion `0' failed.

and i dont know how to catch this...

joefaron commented 12 years ago

I cleaned up a lot of my code on quit()'s on disconnects and fixed my client side socket.io code to not poll longer than 20 and now I'm not getting this.. pretty sure its the client side code goin wacky.. reconnecting and starting dozens of connections per client.

xtassin commented 12 years ago

Only got this when using XHR polling or JSONP transports. It stopped when forcing to Sockets and FlashSockets only.

ryanto commented 12 years ago

I'm seeing this happen with IE clients using XHR polling

nickiv commented 12 years ago

XHR polling seems most suitable for me, so I keep digging it. I think the problem is in reconnect method of socket. It does call handhsake method again and again while handhsake requests (both jsonp and xhr) are not being cancelled at all. Under certain network conditions responces for handhake can be delayed and when they eventually arrive fierce reconnect begins.

Now I got the method to reproduce bug. Suppose we have socket.io server running on 8080. Connect a client via xhr-polling from FF. Then add firewall rule on server: iptables -A OUTPUT -p tcp -m tcp --sport 8080 -j DROP You can see in the network section of firebug some handhake requests pending. Then drop rule: iptables -D OUTPUT -p tcp -m tcp --sport 8080 -j DROP After that reconnection begins.

In my opinion we should not call handhake unless previos one fails. And there must be a timer in it to decide failure.

dominiek commented 12 years ago

Any news on this? This is pretty serious

joelanman commented 12 years ago

I have this issue on socket.io 0.85, it takes up 100% CPU, and even if I kill node and restart, it starts straight back up with loads of 'client not handshaken' warnings. Any workarounds?

Update:

Tracked it down to a client running on IE9 - killed it and issue has gone, but surely a single client shouldnt be doing this?

ryanto commented 12 years ago

@joelanman make sure you ie9 client a socket.io client version 0.8.5.

joelanman commented 12 years ago

thanks - must have just been a client/server mismatch

cowboyrushforth commented 12 years ago

Have been having this problem, even with matching client+server on 0.8.5, but this patch in this thread: https://github.com/LearnBoost/socket.io/issues/534 seems to have definitely helped.

3rd-Eden commented 12 years ago

@cowboyrushforth helped or fixed?

cowboyrushforth commented 12 years ago

I deployed the patch about 36 hours ago, and it seems that the handshake errors continue to slowly decrease over time. (As connected clients finally refresh and download new client side code)

I will continue to keep an eye on the rate of these things and report back in another day or two.

cowboyrushforth commented 12 years ago

Ok its now been 72 hours+ since applying the patch in issue 534, things seem much more stable. No out of control clients, and no high rate of handshake errors. Cheers

liamdon commented 12 years ago

In which file should I apply the patch? Is this going to be fixed in 0.8.6?

3rd-Eden commented 12 years ago

@liamdom it's already the master of socket.io-client so yes, it will be available for the next release

denisu commented 12 years ago

I don't think the base of that problem is solved in 0.8.6. The warnings are gone, but the clients are still reconnecting at a really high rate.

io.sockets.on('connection', function(client) {
    console.log('Client ' + client.id + ' connected');
});

... outputs overall at the same rate like before.

liamdon commented 12 years ago

Agreed, neither the above patch nor 0.8.6 have solved this - I'm getting a very high rate of warn - client not handshaken client should reconnect when using xhr-polling. @cowboyrushforth is the fix still working out for you?

pfeiffer commented 12 years ago

Also seeing this with 0.8.6.

cowboyrushforth commented 12 years ago

@liamdon and others - sorry for misleading you, the patch I thought had worked to solve it completely turns out that the issue still comes back, albeit I've seen it much less, seems like its overall an infrequent occurrence now. I have not been extremely scientific in this, just recording/perusing/munging logs, etc, but as I look now they are piling up with these handshake requests again.

cowboyrushforth commented 12 years ago

Also, fwiw, am running socket.io behind load balanced setup. In recent days a theory has formed that it only occurs only on some clients after the node.js/socket.io server they are speaking to may go away (ec2 drama, or node/app crash), and then the load balancer assigns them to another node.js/socket.io server.

In every testing environment failing scenario with every browser from ie8,9-ff3-7,safari,etc, this works fine, the clients realize the session is invalid after a few seconds (sometimes the polling length, 20 or so), but somehow, in production, with some weird browsers, I think this causes the client to go into a fierce reconnection loop. Haven't been able to reproduce this one reliably which is the crux of it. Anyone else having this issue also behind load balancers or high availability setups where the socket.io session may be severed?

pfeiffer commented 12 years ago

I'm seeing this with clients connecting directly to node.js/socket.io - no load balancers.

dominiek commented 12 years ago

Same here

liamdon commented 12 years ago

@cowboyrushforth Yep I'm behind an Amazon Elastic Load Balancer and using the redis store. I haven't been able to reliably reproduce it either, except in production where, after 20 mins, our socket.io servers are effectively DDOS'd by the handshaking clients.

Do you have a current interim workaround? Don't use xhr-polling?

ryanto commented 12 years ago

the hardest thing about this is that no one is able to reproduce it, yet we all see it happening. i'm still getting these, not as much, but they do still happen.

@cowboy i originally thought it was load balancing so i forced all of my traffic to one node for a week, it still continued to happen.

fwiw, the only time i was able to reproduce this was runing ie9 with a client/server version mismatch. i had 0.8.4 on the client, and 0.8.5 on the server. make sure you have your clients request the latest version when you update the server.

thats all i got, sorry...

thekiur commented 12 years ago

A high number of 'client not handshaken client should reconnect' warnings can be produced by restarting the server quickly. It seems that the client will attempt to use old handshake ID. A refresh on the client side is required to get a working connection. Not sure on what browsers that happens, but if i enable logging, i see lots of those warnings if i restart the server quickly enough. They will graudually fade away.

hovu96 commented 12 years ago

Same on my EC2 instances when using 'xhr-polling' transport. I cannot reproduce it myself (with my browsers) but the log file of the production instance contains tons of this warning.

Is there anything I can try that helps finding out what's going wrong pleas let me know!

denisu commented 12 years ago

I also noticed that many clients which produced handshaking warnings in 0.8.5 are in 0.8.6 recognized as connected for about half a second or less, that makes my node instance running at almost 100% cpu. 0.8.5 run at 5% to 10%.

hovu96 commented 12 years ago

I just tried version 0.8.5 but get the same result (as with 0.8.6): after about 20 seconds after launching the node server (about 100 connections) I get a lot of the "client not handshaken client should reconnect" warnings...

crishoj commented 12 years ago

I was seeing a spew of "client not handshaken client should reconnect" with socket.io-client and Node v0.5.10-pre. After downgrading to Node v0.4.12 the issue seems to have disappeared.

denisu commented 12 years ago

I have this problem with v0.4.8.

pfeiffer commented 12 years ago

I'm seeing this with v0.4.8 and v0.4.12.

I've tried reproducing it, but have been unsuccessful so far. It happens after a restart of the node instance, where some clients reconnect like crazy - this can cause the node instance to fail and when restarted the reconnecting-loop starts again.

I'm seeing this issue more frequently in socket.io v0.8.6 than v0.8.5.

rauchg commented 12 years ago

The solution for this will be in v0.8.7

pfeiffer commented 12 years ago

@guille You're the man! :-)

dominiek commented 12 years ago

Does this mean we've actually found the problem and been able to reproduce it?

I'm saying this because there's been a lot of "reappearances" of this problem in the past months :(