socketio / socket.io

Realtime application framework (Node.JS server)
https://socket.io
MIT License
61.03k stars 10.11k forks source link

Sockets stay / remain established / connected on operation system but disconnected in socket.io #1910

Closed marco-aa closed 8 years ago

marco-aa commented 9 years ago

Hello, my problem : on .connection-event i count new connections, on disconnect-event i reduce the counter. internal it looks good - but on operation system the sockets stay / remain established and the amount is growing. (see my log downside)

Please look at my 2nd posting - i guess i found out what's going on.

Some ideas how to monitor it more detailed ? Can it be a problem between Node and Socket.io an one hand and the interaction on OS level ? At the moment i have to restart the node.js process every few hours ... Some ideas what to check ? (by the way we use {'sync disconnect on unload':true} on client side)

Server: Debian 7.7 full updated, 4 GB, 4 CPU, KVM Server, express@4.10.6, Socket.io 1.2.1, Node.js v0.10.31 - could not finde bugs or issues related to this kind of problems. Yesterday i pachted Host Server up to 7.7 and i changed Node.js to v.0.10.33 - still growing amount of sockets on OS level

Networksettings - i used Debian default and echo "1024 64000" > /proc/sys/net/ipv4/ip_local_port_range echo "1" > /proc/sys/net/ipv4/tcp_tw_reuse echo "15" > /proc/sys/net/ipv4/tcp_fin_timeout

Counted on Server/ OS level


Fr 12. Dez 18:51:05 CET 2014 8 CLOSING 3 FIN_WAIT1 3 FIN_WAIT2 9 LAST_ACK 1 LISTEN 4 SYN_RECV 1308 TIME_WAIT 8009 VERBUNDEN/Established 9343 Counted internal in Socket.io/ Node.js


root@aaaaaa:~# tail -f /var/log/push.log Fri Dec 12 2014 18:46:33 GMT+0100 (CET) 2475 Fri Dec 12 2014 18:47:03 GMT+0100 (CET) 2397 Fri Dec 12 2014 18:47:33 GMT+0100 (CET) 2271 Fri Dec 12 2014 18:48:03 GMT+0100 (CET) 2306 Fri Dec 12 2014 18:48:33 GMT+0100 (CET) 2431 Fri Dec 12 2014 18:49:03 GMT+0100 (CET) 2437 Fri Dec 12 2014 18:49:33 GMT+0100 (CET) 2394 Fri Dec 12 2014 18:50:03 GMT+0100 (CET) 2248 Fri Dec 12 2014 18:50:33 GMT+0100 (CET) 2297 Fri Dec 12 2014 18:51:03 GMT+0100 (CET) 2277

Counted on Server/ OS level


Fr 12. Dez 23:41:17 CET 2014 6 CLOSING 1 FIN_WAIT1 8 FIN_WAIT2 17 LAST_ACK 1 LISTEN 4 SYN_RECV 1081 TIME_WAIT 10417 VERBUNDEN/ Established 11542 Counted internal in Socket.io/ Node.js


root@aaaaaa:~# tail -f /var/log/push.log Fri Dec 12 2014 23:36:31 GMT+0100 (CET) 2034 Fri Dec 12 2014 23:37:01 GMT+0100 (CET) 2021 Fri Dec 12 2014 23:37:31 GMT+0100 (CET) 2004 Fri Dec 12 2014 23:38:01 GMT+0100 (CET) 1977 Fri Dec 12 2014 23:38:31 GMT+0100 (CET) 1989 Fri Dec 12 2014 23:39:01 GMT+0100 (CET) 1972 Fri Dec 12 2014 23:39:31 GMT+0100 (CET) 1999 Fri Dec 12 2014 23:40:01 GMT+0100 (CET) 2021 Fri Dec 12 2014 23:40:31 GMT+0100 (CET) 2005 Fri Dec 12 2014 23:41:01 GMT+0100 (CET) 2013 Fri Dec 12 2014 23:41:31 GMT+0100 (CET) 2027

marco-aa commented 9 years ago

I guess the amount of connected/established sockets are related to the XHR/JSON Connection followed by the upgrade to websocket. One of them is not removed by node/socket.io.

I installed the socket.io chat example and usede DEBUG=* node index.js I did one page load and counted the open sockets. I got 2 Sockets open - fine. I closed the page - i saw the disconnect event in the console. 1 Socket stayed / remaind connected. I did it with a second browser at the same time with the same result.

If i open a second tab in the same browser the amount of sockets grows only +1. A 3rd browser brings the same result as mentionend befor.

You can do the same - but leave the page open and wait a long time - one of the connected sockets are removed - not log entry in socket.io - if i close the page later - i got a logfile entry and no sockets remain/ stay on OS level.

Can somebody reproduce it ? Is this behaviour correct ? (If you have a high amount of page reloads one of the sockets stay relative long in connected/established state - this can grow up to a problem because we have a lot of differnt users in a short time with diffrent browser )

dbackowski commented 9 years ago

I got the same problem.

In client set:

transports: ['websocket', 'xhr-polling', 'polling', 'htmlfile', 'flashsocket']

This eliminates problem in Chrome/Firefox/Safari, not sure about IE (IE >= 10 should work too).

marco-aa commented 9 years ago

Thanks to dbackowski,

i tried it - it was reducing the amount of "hanging" sockets. I guess because if websockets is used in first place socket.io does not need to upgrade the sockets in the background over the other polling technics.

It upgrades now to socket.io 1.3.5 and nodejs 0.10.38 including express 4.12.3 - still the same "problemes" - i checked as well my express connections - couting incoming request and decrease at the moment i send answer - at the end all request are answered and res.end by me ;)

nicofff commented 9 years ago

Hi,

I've been fighting with this issue in our site for a while now.

First I found issue #2026 which pointed the issue to an error in the ws dependency. Since no release has happened with the fixed dependency, I started using the master branch of this repo in our application.

That change didn't fix the issue.

After that I found this issue, and applied the fix suggested by @dbackowski. We did see an improvement with respect to one client opening several connections, and keeping one of them open. But after a few days we went back to look at the issue, and the socket count for the process was still increasing.

I want so far as to set "websocket" as the only available transport for both the client (as we have our own custom fallback logic) AND the server (thinking that maybe some clients haven't refreshed the site since the client changes) and we are still seeing an steady increase in divergence between the clients using the app and the amount of open sockets in the server.

Here is a plot of the last 16 hs or so. Socket.io socket plot

The series plotted there are: The output from io.sockets.lenght The output of "netstat -putona | grep "SERVER_IP:PORT" | grep off|wc -l" And the difference between them.

calzoneman commented 9 years ago

@nicofff I was actually wrong about blaming ws for my socket leak in #2026. I still observed an fd leak even with transports forced to polling (initially I did not observe this, but that turned out to be an anomaly from a dataset that was too small). Updating the dependency should not affect anything since the outdated dependency I mentioned in that issue turned out to be for socket.io-client and not the server, so it shouldn't be causing serverside leaks.

On another note, I am still seeing this issue as well, with both node v0.10 and v0.12 (I haven't tried io.js yet). Currently I'm just stuck restarting the application when it gets close to ulimit -n to prevent it from encountering EMFILE and crashing.

EDIT: An alternative to your netstat command to count sockets is ls -l /proc/{pid}/fd | grep socket | wc -l where {pid} is replaced by the PID of the node process.

nicofff commented 9 years ago

New plot after a week:

NodeSockets After a week

@calzoneman I'm running only with "websocket" transport (both client and server). If you are seeing the same problem with only polling, then maybe we can discard the transport as a variable. There IS also an issue with polling causing dual connections from the client that remain open. as said above by @marco-aa, and using websocket as the first transport (as per @dbackowski ) does fix it.

I'll probably get a new environment running to see if I can reproduce it in a more controlled way. It would be great if the devs could jump in. I'm willing to provide any data I can to help fix this.

tobiasmuecksch commented 9 years ago

Is anyone on that issue? I'd love to use socket.io for my project, but this is not acceptable for a software with a large number of connections.

apeace commented 9 years ago

I'm also experiencing this issue. I'm willing to help in any way I can.

I'm using: socket.io v1.3.6, node v0.12.7

We have only ~200 clients connected, yet we see our open sockets grow from 200 to 1024 every 24 hours, before an EMFILE exception is raised and the process restarts.

I've been monitoring the sockets that Node has open using lsof -i -n -P | grep node | wc -l. It starts around ~200 as I would expect. Over the course of several hours, it grows, before eventually reaching the ulimit -n (1024 in my case) and causing an EMFILE exception to crash the process.

I compare this to io.server.engine.clientsCount, which I log on each connect and disconnect event. It is consistently around ~200, which is the number of clients we know we have.

My clients occasionally disconnect, then reconnect again. My server and clients both have very detailed application logging. It allows me to confirm that socket.io on both client and server side think the client is disconnecting, and reconnecting soon after.

But when inspecting the output of lsof -i -n -P | grep node, I am seeing that there are multiple sockets ESTABLISHED to the same clients--which does not align with the io.server.engine.clientsCount on the server-side, or the application logging on the client side.

I have not been able to reproduce this in a controlled manner. I have tried having a client connect and disconnect repeatedly to a testing server, but have not seen ESTABLISHED sockets grow the way it does on my production deployment.

Regarding the second comment by @marco-aa:

I installed the socket.io chat example and usede DEBUG=* node index.js I did one page load and counted the open sockets. I got 2 Sockets open - fine. I closed the page - i saw the disconnect event in the console. 1 Socket stayed / remaind connected.

This sounds like Chrome leaving a TCP connection open in case it needs to perform more HTTP requests. Even if you close the tab, Chrome will leave a TCP connection open for a while. It does this to all sites you visit. If you completely quit Chrome it will close that connection--or it will close it after a few minutes. It sounds like @marco-aa also hit on the real issue here, but I think the extra socket he noticed when using the sample application, was just Chrome. Just thought I'd mention it because that confused me for a bit, too.

ghost commented 9 years ago

I'm also experiencing this issue (I'm using the same version of node and socket.io as @apeace) - it can be reproduced very easly, by cutting connection between client and server long enogh to client realized that is disconnected. After the connection is restored, there is another open file descriptor (previous is not closed) on the server side (each time).

This is a very serious problem for me - with default parameters of ulimit my socket.io server can not work more than a few hours on a production environment.

apeace commented 9 years ago

@pist could you explain in more detail how you reproduced that?

whifflevcq commented 9 years ago

I have a problem in #2184. @apeace, can you help me?

apeace commented 9 years ago

@whifflevcq I suggest closing #2184 as it is a duplicate of this issue. Hopefully if we can get enough comments on this thread we can get some attention to it.

If I have time this weekend I am going search for the bug in the socket.io codebase. I have been waiting because I am very busy at my job, I don't have time to dive into an unfamiliar codebase. There's been no response on this thread from socket.io developers, and no response in the socket.io Slack for the last 3 days.

It would be a huge help if @pist (or anyone) could describe a method to reliably reproduce the issue. I think that would be a great starting point for me to debug.

whifflevcq commented 9 years ago

I will close Issue #2184. I've used transport polling and websocket. I found that if we use polling, when client disconnect, the port connections not create socket still keep EST connection. And when I use websocket, the port connection create socket keep EST connection (When client send request socket, have 3 TCP connection established and 1 of 3 TCP connection will create socket connection)

apeace commented 9 years ago

I am able to reproduce the issue consistently now. See this gist for the NodeJS scripts I used.

I ran server.js on one machine, and client.js on another. Using iptables, I periodically cut the connection between the two:

# from the client machine, block connections to destination port 9000
sudo iptables -A OUTPUT -p tcp --dport 9000 -j DROP

# wait 30s for client to say 'disconnect'...

# from the client machine, unblock connections
sudo iptables -F

# wait a few seconds for client to say 'connect'...
# rinse and repeat

Then using lsof -i -n -P | grep node on the server, I can see the open sockets growing. Most of the time, after a disconnect and re-connect, a socket remains. But not always: sometimes the server cleans up the socket properly. After 15 iterations for me, I had 7 sockets left open (with 0 clients actually connected from socket.io's perspective).

I did this by running the server on OS X, and running the client inside a VirtualBox Ubuntu machine.

apeace commented 9 years ago

Today I repeated the above experiment, but in server.js I specified ONLY the polling transport:

var options = {
  transports: ['polling']
};
var io = require('socket.io')(options);

I am seeing the same result when blocking/unblocking using iptables: the number of sockets on the server continues to grow. This confirms what others have said (@calzoneman, @whifflevcq).

calzoneman commented 9 years ago

I'm going to try running your demonstration to see if I can reproduce the problem. I had tried doing this without iptables by manually killing my network connection on my laptop, but that was tedious and didn't give any promising results.

whifflevcq commented 9 years ago

This issue has since Dec,2014 ~ 9month! And me, I was researching this issue 2 months. Maybe should stop the project and wait...

calzoneman commented 9 years ago

@apeace Unfortunately I was not able to reproduce your result. I ran the server on my computer and the client in a VM, and left a script running on the VM to block/unblock the connection periodically. After many cycles I still see no leak from the server process's fd list. This is a very difficult bug to track down, I spent hours digging into it a few months ago and never made any progress.

apeace commented 9 years ago

@calzoneman thanks so much! That is a big help.

I'm wondering: did your script wait for the client and server to emit disconnect, before unblocking the port again? In my case, I had to wait roughly ~30 seconds to be sure I saw the disconnect happen each time. Then after doing iptables -F again, the client would quickly connect again, and I would see a new socket open (while the old socket was still open).

If you did in fact do it this way, can you provide more detail on your setup? I'd be very interested if there are any important differences between our setups that may trigger the problem for me, but not for you. This is how I was doing it:

@whifflevcq yes it is frustrating. But thankfully the community is helping out. I think it was really helpful that you closed your duplicate issue--now we can focus our efforts here. The socket.io developers are very busy, but I've gotten in touch with one of them and I think that together we should be able to solve this soon :)

One workaround you can do (so you don't have to stop your project), is listen for EMFILE exceptions and restart the process when these occur. That is what my company is doing in production, and it is okay. Luckily the socket.io-client reconnects very quickly when the server restarts, so there is no real disruption to our users.

If you want help doing that for now, ping me in the socket.io Slack (I'm apeaceq).

nkzawa commented 9 years ago

I think I could reproduce the issue on my laptop, both server and client were running on it. I blocked the connection using pfctl like @apeace .

I'm still not sure, but am suspecting the cause is that engine.io doesn't close a tcp socket when detecting ping timeout.

https://github.com/socketio/engine.io/blob/master/lib/socket.js#L130

As far as I tried, it seems a dead connection was removed if I called this.transport.close() (you have to wait additional 30 seconds until the socket is actually cleaned up).

apeace commented 9 years ago

Hey all, thanks to @nkzawa pointing me in the right direction, I think I've found a fix! Check out the above PR.

I'll be testing it in my production environment and reporting back on that PR. I'd encourage you all to help me test however you are able! @whifflevcq @calzoneman @pist @nicofff

apeace commented 9 years ago

Sorry @nkzawa I just realized you gave me the solution right there. Yes, calling transport.close() was all that was needed :)

Thank you again for the fix!

calzoneman commented 9 years ago

I'll give it a shot

whifflevcq commented 9 years ago

Thanks @apeace @calzoneman @nkzawa I just stop research this issue. I still developing my project and thanks to community help me solve this issue. I feel very happy when more people participate in solving issue. I will test and feedback.

nicofff commented 9 years ago

Just deployed this fix to prod. So far looks good with ~ 1000 users. I'll report back in a couple of days

ghost commented 9 years ago

I confirm that after applying the fix, the problem ceased to exist. Thank you very much!

apeace commented 9 years ago

Update on this, the PR has been merged!

I'll close this issue once I see a new socket.io version that includes this change.

barisusakli commented 9 years ago

@apeace Nice looking forward to the new version.

marco-aa commented 8 years ago

Following "nkzawa commented on 6 Aug" i added the "patch" this.transport.close() into engine.io and the nodejs and socket.io are running now stable (no restart every 6 hours ;) ) We had inbetween reinstalled our service on a new server with socket.io@1.3.7, express@4.13.3 and node v0.10.36 including the patch from nkzawa - running fine for 2,5 weeks now

I hope my feedback helps :) Thanks four your help and work :) ! From my point of view my request is closed ;)

nkzawa commented 8 years ago

Should be fixed now, thanks for your feedbacks !!!

dmr07 commented 7 years ago

Hey guys, I'm using v1.7.2. I added self.transport.close(); (see below) to engine.io but I'm still not getting above 1000 connections. I'm doing a console.log() on every connection and it gets hung at 259 connections every time.

Does anyone know what might be causing this?

Socket.prototype.setPingTimeout = function () {
    var self = this;
    clearTimeout(self.pingTimeoutTimer);
    self.pingTimeoutTimer = setTimeout(function () {
        self.onClose('ping timeout');
        self.transport.close(); // This is the added line
    }, self.server.pingInterval + self.server.pingTimeout);
};
marco-aa commented 7 years ago

Maybe have a look here https://github.com/socketio/socket.io/issues/1393 and check ulimit for the shell which is running your node/socket.io server

dmr07 commented 7 years ago

Thanks. I added ulimit -n 8192. It's no longer getting stuck at 259, but something else is happening:

I have a counter that gets incremented every time a new connection is detected and decremented when a connection is disconnected. As I ran a loop to create 1,000 new connections, the server logged the counter as high as 4,100, and it is in constant flux at range of (300, 2000).

Am I counting incorrectly or doing something else wrong? Is there a more advisable way to count the number of concurrent connections?

Front-end Loop:

var i = 1;
function myLoop () {
   setTimeout(function () {
      socket.params = {
        'path'      : '/api/socket'
      , 'forceNew'  : true
      , 'transport' : ['websocket']
      , 'secure'    : true
      };
      socket.url.protocol = 'https';
      socket.url.domain   = 'ws.domain.com';
      socket.url.full          = socket.url.protocol + '://' + socket.url.domain;
      socket.connection  = io.connect(socket.url.full, socket.params);
      console.log(i)
      i++;
      if (i < 1000) {
         myLoop();
      }
   }, 80);
}
myLoop();

Socket Counter:

var connectCounter = 0;
api.listen = function(server) {
  const io = socket.listen(server);
  api.io = io;

  io.on('connection', function(socket) {
    connectCounter++;
    console.log('Sockets counter: ' + connectCounter);

    // code omitted

    // disconnect
    socket.on('disconnect', function(){
      connectCounter--;
      console.log('Sockets counter: ' + connectCounter);
    }); // end - disconnect
  }) // end - socket.io connection

  return this.io;
}; // end - listen

Node 7.4.0 Socket.io 1.7.2 System: Ubuntu 16.0.4 AWS EC2 T2 Micro Single Instance Node

dmr07 commented 7 years ago

I just ran the test a few more times. It's still dropping connections at around 300. I wasn't able to replicate my issue above ^. Should I revert to an older versions?

For people able to get over 1,000 concurrent connections, what version are you using?

I'm currently running this on AWS EC2 T2 Micro.