Closed squidfunk closed 10 years ago
Hi,
were you able to solve it? I am still having this problem with 0.7.8. I am not able to reproduce it on my machine, but i can see it in the debug logs, some clients go really crazy (looks like more than 50 reconnects/second). I have this problem only with jsonp and xhr connections, turning off one of them didn't help though.
debug - setting request GET /socket.io/1/jsonp-polling/487577450665437510?t=1312872393095&i=1
debug - setting poll timeout
debug - clearing poll timeout
debug - jsonppolling writing io.j[1]("7:::1+0");
debug - set close timeout for client 487577450665437510
warn - client not handshaken client should reconnect
info - transport end
debug - cleared close timeout for client 487577450665437510
debug - discarding transport
debug - setting request GET /socket.io/1/xhr-polling/487577450665437510
debug - setting poll timeout
debug - clearing poll timeout
debug - xhr-polling writing 7:::1+0
debug - set close timeout for client 487577450665437510
warn - client not handshaken client should reconnect
info - transport end
debug - cleared close timeout for client 487577450665437510
debug - discarding transport
debug - setting request GET /socket.io/1/jsonp-polling/487577450665437510?t=1312872393150&i=1
debug - setting poll timeout
debug - clearing poll timeout
debug - jsonppolling writing io.j[1]("7:::1+0");
debug - set close timeout for client 487577450665437510
warn - client not handshaken client should reconnect
info - transport end
debug - cleared close timeout for client 487577450665437510
debug - discarding transport
debug - setting request GET /socket.io/1/xhr-polling/487577450665437510
debug - setting poll timeout
debug - clearing poll timeout
debug - xhr-polling writing 7:::1+0
debug - set close timeout for client 487577450665437510
warn - client not handshaken client should reconnect
...
I am using custom namespaces btw.
TCP dump of one of those terror-connections, if that helps (this goes on continously).
10:21:13.450665 IP (removed).8433 > (removed).65471: Flags [P.], seq 1973377331:1973377336, ack 537321861, win 54, length 5
@.@....(!....E ...u.Y3 ...P..6L....2::.
10:21:13.480742 IP 77.12.111.190.64720 > 188.40.33.215.8433: Flags [P.], seq 29040:29700, ack 6557, win 4169, length 660
....E...V.@.z...M.o..(!... ..%.)M...P..I^p..GET /socket.io/1/jsonp-polling/12189419471609411629?t=1312878072071&i=1 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: */*
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Cookie: (removed)
10:21:13.481279 IP (removed).8433 > (removed).64720: Flags [P.], seq 6557:6706, ack 29700, win 9911, length 149
...$!).q..E.....@.@....(!.M.o. ...M....%..P.&..7..HTTP/1.1 200 OK
Content-Type: text/javascript; charset=UTF-8
Content-Length: 19
Connection: Keep-Alive
X-XSS-Protection: 0
io.j[1]("7:::1+0");
10:21:13.504212 IP (removed).64725 > (removed).8433: Flags [P.], seq 21428:21915, ack 6249, win 4356, length 487
.M.o..(!... ....\W,..P.......GET /socket.io/1/xhr-polling/12189419471609411629 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Origin: (removed)
10:21:13.504593 IP 188.40.33.215.8433 > 77.12.111.190.64725: Flags [P.], seq 6249:6391, ack 21915, win 7846, length 142
...$!).q..E.....@.@....(!.M.o. ...W,.....CP.......HTTP/1.1 200 OK
Content-Type: text/plain; charset=UTF-8
Content-Length: 7
Connection: Keep-Alive
Access-Control-Allow-Origin: *
7:::1+0
10:21:13.542058 IP (removed).64720 > (removed).8433: Flags [P.], seq 29700:30360, ack 6706, win 4132, length 660
....E...V.@.z...M.o..(!... ..%..M.._P..$Ro..GET /socket.io/1/jsonp-polling/12189419471609411629?t=1312878072149&i=1 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: */*
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Cookie: (removed)
10:21:13.542557 IP (removed).8433 > (removed).64720: Flags [P.], seq 6706:6855, ack 30360, win 9921, length 149
...$!).q..E.....@.@....(!.M.o. ...M.._.%.QP.&.....HTTP/1.1 200 OK
Content-Type: text/javascript; charset=UTF-8
Content-Length: 19
Connection: Keep-Alive
X-XSS-Protection: 0
io.j[1]("7:::1+0");
10:21:13.567452 IP (removed).64725 >(removed)8433: Flags [P.], seq 21915:22402, ack 6391, win 4320, length 487
.M.o..(!... ....CW,.^P.......GET /socket.io/1/xhr-polling/12189419471609411629 HTTP/1.1
Host: (removed):8433
User-Agent: Mozilla/5.0 (Windows NT 6.1; rv:5.0) Gecko/20100101 Firefox/5.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: de-de,de;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Referer: (removed)
Origin: (removed)
Hi denisu,
actually for the last 2 weeks these high rate of handshake warnings was not as important as the crashes of my client's chat server due to memory leaks. These we're hopefully fixed with the latest pull of 3rd-eden into the master. In the next days I will be at my client's site again to check whether the chat server has crashed and investigate those handshaken warnings again.
For now they don't seem to be very severe (at least not as severe as the crashes). I will keep you updated here.
Hi, thank you for the reply!
You are right, its not that critical. The server can handle this high rate of connections easily. I hope it doesn't cause any problems on the client side. I still wasn't able to reproduce it, but until now I had this problem with clients using Firefox 5.0, Opera 11 and IE 8.0 over the xhr and jsonp transports.
I will let you know if I found out more.
I've been having problems with this too...wondering whether there's something I should be doing on the client side that I'm not.
Yeah, for about 500 connected clients, there are 2 to 5 clients constantly reconnecting (absolutely no delay). Clients with a super fast internet connection were reconnecting so fast that I had to block their IP in the firewall, since it was affecting the server.
@squidfunk can you reopen this issue?
I tried to investigate the problem but didn't find anything so far. However It may probably related to another problem, which is caused by the fact that socket.io doesn't enable Flashsocket transport by default. This leads to a problem in IE 6 and 7 when the chat server is not on the same domain as the client script, as IE 6 and 7 seem to prohibit cross-origin long polling (no problem with IE 8 though). Therefore it is not possible with IE 6 or 7 to connect. Maybe this is related.
I reopened it. Actually I closed it because I thought I was the only one experiencing the problem and that it may be too specific. But it doesn't seem to be.
This problem happened to me when I'm using Firefox 3.6.18 and change the codification from UTF-8 to ISO-8859-1.
@rafaelbrizola: can you give a little more information? Did the "client not handshaken" warnings increase or weren't you be able to connect at all? Actually my chat server is running (has to run) in an ISO-8859-1 environment (sadly enough).
My app runs in an ISO-8859-1 environment too, and the node server is not on the same domain as the client script. Maybe this could be some hint on whats wrong :). But I have seen all common browsers (except the ones that support Websockets) having this problem. Still investigating...
I've been experiencing the same thing since moving to 0.7.7 (now on 0.7.9). Also with a cross-domain Node/webserver setup. The biggest problem is that I have to set the log level to 0 to avoid filling up the tiny EC2 disk.
Let me know if I can provide anything useful for debugging.
Hi there,
Also experiencing this issue intermittently with brand new deployment of socket.io 8.4.
Seems to occur 20-30 minutes after server start. Until then, everything is normal, everyone is happy.
But after about 20-30 minutes in, 90% of the messages in the log are simply 'client not handshaken should reconnect'.
Once node instances are restarted ( we are running 8 of them ), then the issues goes away, until it comes back 20-30 minutes later.
For what its worth, we are load balancing (with ip based session persistence) all 8 of the node processes.
Would be happy to provide any more details on our setup. Cheers, and thanks for Socket.IO! Awesome!
I suspected that changing https://github.com/LearnBoost/socket.io/blob/master/lib/transport.js#L157 from .end
to .close
But I havent been able to verify my finds yet. But I also see the same rate of requests coming in.
But it's combination of fail from the server and fail from the client side code.
@3rd-Eden - I have put your suggested fix into production for almost 45 minutes now, and rate of 'client not handshaken...' messages has significantly dropped. There is still occasionally these messages coming in, but they are about equal to log messages of real actual message passing activity. I will continue to watch this issue for a bit, but you may have solved it.
Thanks a million! Will keep this thread up to date if my situation changes.
I may have spoke a bit too soon. Given enough time, it seems that all the node instances do still fall back into the loop of client not handshaken messages. But for some reason it seems as though it lasts longer because they need to be restarted.
I'm seeing this issue as well with about 700 connected clients.. How do you re-connect?
My log is flooded with: warn - client not handshaken client should reconnect warn - client not handshaken client should reconnect warn - client not handshaken client should reconnect warn - client not handshaken client should reconnect warn - client not handshaken client should reconnect warn - client not handshaken client should reconnect warn - client not handshaken client should reconnect warn - client not handshaken client should reconnect
having the same poblem... also about 500-1000 connections...
but it is also crashing with this from time to time.. :
node: src/uv-common.c:92: uv_err_name: Assertion `0' failed.
and i dont know how to catch this...
I cleaned up a lot of my code on quit()'s on disconnects and fixed my client side socket.io code to not poll longer than 20 and now I'm not getting this.. pretty sure its the client side code goin wacky.. reconnecting and starting dozens of connections per client.
Only got this when using XHR polling or JSONP transports. It stopped when forcing to Sockets and FlashSockets only.
I'm seeing this happen with IE clients using XHR polling
XHR polling seems most suitable for me, so I keep digging it. I think the problem is in reconnect method of socket. It does call handhsake method again and again while handhsake requests (both jsonp and xhr) are not being cancelled at all. Under certain network conditions responces for handhake can be delayed and when they eventually arrive fierce reconnect begins.
Now I got the method to reproduce bug. Suppose we have socket.io server running on 8080. Connect a client via xhr-polling from FF. Then add firewall rule on server: iptables -A OUTPUT -p tcp -m tcp --sport 8080 -j DROP You can see in the network section of firebug some handhake requests pending. Then drop rule: iptables -D OUTPUT -p tcp -m tcp --sport 8080 -j DROP After that reconnection begins.
In my opinion we should not call handhake unless previos one fails. And there must be a timer in it to decide failure.
Any news on this? This is pretty serious
I have this issue on socket.io 0.85, it takes up 100% CPU, and even if I kill node and restart, it starts straight back up with loads of 'client not handshaken' warnings. Any workarounds?
Update:
Tracked it down to a client running on IE9 - killed it and issue has gone, but surely a single client shouldnt be doing this?
@joelanman make sure you ie9 client a socket.io client version 0.8.5.
thanks - must have just been a client/server mismatch
Have been having this problem, even with matching client+server on 0.8.5, but this patch in this thread: https://github.com/LearnBoost/socket.io/issues/534 seems to have definitely helped.
@cowboyrushforth helped or fixed?
I deployed the patch about 36 hours ago, and it seems that the handshake errors continue to slowly decrease over time. (As connected clients finally refresh and download new client side code)
I will continue to keep an eye on the rate of these things and report back in another day or two.
Ok its now been 72 hours+ since applying the patch in issue 534, things seem much more stable. No out of control clients, and no high rate of handshake errors. Cheers
In which file should I apply the patch? Is this going to be fixed in 0.8.6?
@liamdom it's already the master of socket.io-client so yes, it will be available for the next release
I don't think the base of that problem is solved in 0.8.6. The warnings are gone, but the clients are still reconnecting at a really high rate.
io.sockets.on('connection', function(client) {
console.log('Client ' + client.id + ' connected');
});
... outputs overall at the same rate like before.
Agreed, neither the above patch nor 0.8.6 have solved this - I'm getting a very high rate of warn - client not handshaken client should reconnect
when using xhr-polling. @cowboyrushforth is the fix still working out for you?
Also seeing this with 0.8.6.
@liamdon and others - sorry for misleading you, the patch I thought had worked to solve it completely turns out that the issue still comes back, albeit I've seen it much less, seems like its overall an infrequent occurrence now. I have not been extremely scientific in this, just recording/perusing/munging logs, etc, but as I look now they are piling up with these handshake requests again.
Also, fwiw, am running socket.io behind load balanced setup. In recent days a theory has formed that it only occurs only on some clients after the node.js/socket.io server they are speaking to may go away (ec2 drama, or node/app crash), and then the load balancer assigns them to another node.js/socket.io server.
In every testing environment failing scenario with every browser from ie8,9-ff3-7,safari,etc, this works fine, the clients realize the session is invalid after a few seconds (sometimes the polling length, 20 or so), but somehow, in production, with some weird browsers, I think this causes the client to go into a fierce reconnection loop. Haven't been able to reproduce this one reliably which is the crux of it. Anyone else having this issue also behind load balancers or high availability setups where the socket.io session may be severed?
I'm seeing this with clients connecting directly to node.js/socket.io - no load balancers.
Same here
@cowboyrushforth Yep I'm behind an Amazon Elastic Load Balancer and using the redis store. I haven't been able to reliably reproduce it either, except in production where, after 20 mins, our socket.io servers are effectively DDOS'd by the handshaking clients.
Do you have a current interim workaround? Don't use xhr-polling?
the hardest thing about this is that no one is able to reproduce it, yet we all see it happening. i'm still getting these, not as much, but they do still happen.
@cowboy i originally thought it was load balancing so i forced all of my traffic to one node for a week, it still continued to happen.
fwiw, the only time i was able to reproduce this was runing ie9 with a client/server version mismatch. i had 0.8.4 on the client, and 0.8.5 on the server. make sure you have your clients request the latest version when you update the server.
thats all i got, sorry...
A high number of 'client not handshaken client should reconnect' warnings can be produced by restarting the server quickly. It seems that the client will attempt to use old handshake ID. A refresh on the client side is required to get a working connection. Not sure on what browsers that happens, but if i enable logging, i see lots of those warnings if i restart the server quickly enough. They will graudually fade away.
Same on my EC2 instances when using 'xhr-polling' transport. I cannot reproduce it myself (with my browsers) but the log file of the production instance contains tons of this warning.
Is there anything I can try that helps finding out what's going wrong pleas let me know!
I also noticed that many clients which produced handshaking warnings in 0.8.5 are in 0.8.6 recognized as connected for about half a second or less, that makes my node instance running at almost 100% cpu. 0.8.5 run at 5% to 10%.
I just tried version 0.8.5 but get the same result (as with 0.8.6): after about 20 seconds after launching the node server (about 100 connections) I get a lot of the "client not handshaken client should reconnect" warnings...
I was seeing a spew of "client not handshaken client should reconnect" with socket.io-client and Node v0.5.10-pre. After downgrading to Node v0.4.12 the issue seems to have disappeared.
I have this problem with v0.4.8.
I'm seeing this with v0.4.8 and v0.4.12.
I've tried reproducing it, but have been unsuccessful so far. It happens after a restart of the node instance, where some clients reconnect like crazy - this can cause the node instance to fail and when restarted the reconnecting-loop starts again.
I'm seeing this issue more frequently in socket.io v0.8.6 than v0.8.5.
The solution for this will be in v0.8.7
@guille You're the man! :-)
Does this mean we've actually found the problem and been able to reproduce it?
I'm saying this because there's been a lot of "reappearances" of this problem in the past months :(
I am running a chat server with node.js / socket.io and have a lot of "client not handshaken" warnings. In the peak time there are around 1.000 to 3.000 open TCP connections.
For debugging purposes I plotted the graph of actions succeeding the server-side "set close timeout" event, because the warnings are always preceded by those, so the format is:
The following plot explained:
Full size
I did not change the default timeouts and intervals provided by socket.io, but I think it is very strange, that there is a peak of handshake errors at around 10 seconds (even surpassing the successful cleared close timeouts!). Did anyone experience a similar situation?
Best regards, Martin