socketio / socket.io

Realtime application framework (Node.JS server)
https://socket.io
MIT License
61.18k stars 10.11k forks source link

Socket.io connects and disconnects only for some clients resulting in package lost / unreliability #4298

Open aaronkchsu opened 2 years ago

aaronkchsu commented 2 years ago

Describe the bug A clear and concise description of what the bug is. We have an nodejs websocket server with 3k+ concurrent connections. A few segment of clients disconnect and reconnect every few seconds/minutes.

We trigger the connection by joining to a room.

        roomsOnline.add(socket.handshake.query.roomId);
        socket.join(socket.handshake.query.roomId);

We use AWS ELB and their support says the load balancer has no problem.

image

On my machine i stay connected with the exact same roomId.

However his connection on the server end looks like this where it disconnects and reconnects. When we try to send the client a message usingio.to(roomId).emit() it will only few of the messages instead of every message emitted

image

One of the clients that saw this behavior we tested with a different browser source from another websocket app and the messages work every time to his computer. The same client also had high speed spectrum internet.

To Reproduce

Our server settings - version 4.4.1

 allowEIO3: true, // false by default
    pingInterval: 9000,
    pingTimeout: 15000,

Server


const server = process.env.USE_HTTPS
    ? require("https").createServer(
          {
              cert: fs.readFileSync(process.env.TLS_CERT),
              key: fs.readFileSync(process.env.TLS_KEY),
          },
          app,
      )
    : require("http").createServer(app);

const { Server } = require("socket.io");

// https://stackoverflow.com/questions/48648555/understanding-socket-io-ping-interval-timeout-settings
const io = new Server(server, {
    // transports: ["polling", "websocket"],
    allowEIO3: true, // false by default
    pingInterval: 9000,
    pingTimeout: 15000,

    cors: {
        origin: "*",
        methods: [
            "GET",
            "POST",
            "OPTIONS",
            "HEAD",
            "PUT",
            "PATCH",
            "POST",
            "DELETE",
        ],
    },
});

Socket.IO client version: x.y.z

Client

    <script src="/socket.io/socket.io.js"></script>

    let socket = io({
          query: {
            roomId: last_part
          }
        });

Expected behavior Expect all clients to stay connected

Platform:

Additional context I added a disconnect protocol which helped some clients that were international, but few clients still seeing this behavior

Server
let disconnectCheck = setInterval(() => {
            socket
                .timeout(16000)
                .emit(
                    "heart_beat_check_server",
                    "beating",
                    { running: new Date() },
                    (err, payload) => {
                        console.log(
                            "HEART_BEAT_PAYLOAD",
                            payload,
                            socket.handshake.query.roomId,
                        );
                        if (
                            err &&
                            payload &&
                            payload.version &&
                            payload.version >= 3
                        ) {
                            // the client did not acknowledge the event in the given delay
                            console.error(
                                "HEARTBEAT_ERROR_TIMEOUT",
                                socket.handshake.query.roomId,
                            );
                            socket.disconnect(true);
                        } else {
                        }
                    },
                );
        }, 17000);

Client
   socket.on("heart_beat_check_server",  (_, __, callback) => {
            console.error("heart_beat_check_server_client")
            callback({ roomdId: last_part, version: browserVersion })
        });

socket.on("disconnect", (reason) => {
          console.error("DISCONNECTED ", reason)
          if (reason === "io server disconnect" || reason === "io client disconnect" || reason === "ping timeout") {
            // the disconnection was initiated by the server, you need to reconnect manually
            socket.connect();
          }
          // else the socket will automatically try to reconnect
        });
darrachequesne commented 2 years ago

That sounds weird indeed.

Do you know the reason of the disconnection on the client side? Any error thrown?

Is this specific to a given browser?

aaronkchsu commented 2 years ago

Not a specific browser, it happens on all their browsers which leads me to believe its a network issue?

We get a ping timeout but disappears after a few times which makes think it might think its connected when its not?

For separate clients we have seen a transport close error after few hours and no try to reconnect, is that normal or a way to fix or debug that?

Thanks so much @darrachequesne

matiaslopezd commented 2 years ago

+1! We're dealing with that but we don't know if is socket.io and/or the tab throttling feature in browsers. (Maybe exist a bug in Chromium-based browsers for a long period of sessions where drop WebSocket connections)

@aaronkchsu ping me, maybe we can gather info around this and share with @darrachequesne.

The most common error we detect is transport close, and the socket.io-client does not reconnect automatically.

image

hmeerlo commented 2 years ago

I'm also experiencing a similar issue where clients occasionally report a 'ping timeout' and the server-side the corresponding transport close error. I debugged this by doing a tcpdump on the client-side. And what I observed is that the server very neatly sends the PING packet every 30s (my pingInterval, pingTimout is 10s). But when it fails it has sent the PING packet too late (after 42s, more than pingInterval + pingTimeout). So it smells like a server side issue to me.

johnfilo-kmart commented 2 years ago

I have a similar problem with a flask-socketio server app hosted behind an aws alb and the client socket session seems to be consistently disconnecting every 26 seconds.

Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/eventlet/wsgi.py", line 573, in handle_one_response result = self.application(self.environ, start_response) File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 2464, in call return self.wsgi_app(environ, start_response) File "/usr/local/lib/python3.8/site-packages/flask_socketio/init.py", line 45, in call return super(_SocketIOMiddleware, self).call(environ, File "/usr/local/lib/python3.8/site-packages/engineio/middleware.py", line 60, in call return self.engineio_app.handle_request(environ, start_response) File "/usr/local/lib/python3.8/site-packages/socketio/server.py", line 560, in handle_request return self.eio.handle_request(environ, start_response) File "/usr/local/lib/python3.8/site-packages/engineio/server.py", line 374, in handle_request socket = self._get_socket(sid) File "/usr/local/lib/python3.8/site-packages/engineio/server.py", line 565, in _get_socket raise KeyError('Session is disconnected') KeyError: 'Session is disconnected'

Client requests use /socket.io/?EIO=3&transport=polling&t=O9hzOSn&sid=xxxx

Server version of SocketIO is 4.6.0

matiaslopezd commented 2 years ago

This could be related to the timeout of the TCP connections in the proxy servers or balancers in front of the app. We increased the TCP timeout and the stability of the connection was improved, but not 100%.

In fact, the ping/pong heartbeat is designed not only to check the connection between server and client but is to maintain the connection alive to avoid proxy timeout, by default on proxies is 60 seconds. So, check if the interval is lower than the default value.

Nginx timeout, AWS Load Balancer.

Also, check this post: https://blog.martinfjordvald.com/websockets-in-nginx.

johnfilo-kmart commented 2 years ago

The idle timeout was set to 30 seconds on the load balancer in AWS. I did increase this to 90 seconds this morning to see if this played a role in any way, and the consistent ~26 second session disconnects remained unchanged,

johnfilo-kmart commented 2 years ago

As far as how SocketIO is configured in this server app, here is the code (it's quite vanilla):

!/usr/bin/python

coding: utf-8

from flask import Flask from flask_socketio import SocketIO import os

from Config import ServerConfig

from main.Utils import ServerLogger

socketio = SocketIO()

gLogger = ServerLogger()

from dbinterface.DatabaseInterface import DatabaseInterface gDatabaseInterface = DatabaseInterface()

from settings.Settings import SettingsManager gSettingsManager = SettingsManager()

from robots.RobotsManager import RobotsManager gRobotsManager = RobotsManager()

from users.UserManager import UserManager gUserManager = UserManager()

def create_app(config_class=ServerConfig): app = Flask(name) app.config.from_object(config_class) app.config.from_envvar("SERVER_CONFIG_FILE") if not os.path.exists(app.config["LOG_PATH"]): os.makedirs(app.config["LOG_PATH"]) gLogger.setLogPath(app.config["LOG_PATH"]) gLogger.log("Creating App") socketio.init_app(app) registerBlueprints(app) gDatabaseInterface.init_interface(app) gRobotsManager.loadRobots() gUserManager.loadUsers() gUserManager.setSecretKey(app.config["SECRET_KEY"]) gSettingsManager.loadSettings() gLogger.log("Init done") return app

johnfilo-kmart commented 2 years ago

And for the client:

let ServiceModule = angular.module('ServiceModule', []); ServiceModule.service('Socket', function($rootScope){ var socket = io.connect(); return { on: function(eventName, callback) { socket.on(eventName, function() { var args = arguments; $rootScope.$apply(function() { callback.apply(socket, args); }); }); }, emit: function(eventName, data, callback) { if(typeof data == 'function') { callback = data; data = {}; } socket.emit(eventName, data, function() { var args = arguments; $rootScope.$apply(function() { if(callback) { callback.apply(socket, args); } }); }); }, emitAndListen: function(eventName, data, callback) { this.emit(eventName, data, callback); this.on(eventName, callback); } }; });

matiaslopezd commented 2 years ago

@johnfilo-kmart Do you consider the wake-up throttling in Chrome? In fact, is an active bug, check: https://bugs.chromium.org/p/chromium/issues/detail?id=1224672&q=websocket&can=2

You need to apply a technique to avoid this.

johnfilo-kmart commented 2 years ago

Ooh, no I didn't know about that. I'll look into it.