socketio / socket.io

Realtime application framework (Node.JS server)
https://socket.io
MIT License
61.17k stars 10.11k forks source link

Server-Side Event "connection" not fired/called on Client-Connection (using wild-card namespaces) #4677

Open s3cc0 opened 1 year ago

s3cc0 commented 1 year ago

Describe the bug We use socket.io with Google Cloud Run with with redis-adapter for exchanging data across node-cluster and multiple containers. The challenge is that Google Cloud Run kills the connections every 60min, we don't use sticky extension as we work on websockets only ourselves. Thus do not support polling. When scaling the containers no matter in which direction, it happens from time to time that the client gets a socket connection, but from server the event "connection" is not called. As a result, all other events on the socket itself do not work.

Important: We use Namespaces and Channels

To Reproduce

  1. create a system with multiple servers and node-cluster
  2. use redis-adapter for exchange between all nodes
  3. just connect with the clients, simulate disconnects, or wait for normal disconnect from server (tcp)
  4. sometimes it works sometimes not, connection will not fired

Version for frontend and backend: "@socket.io/admin-ui": "^0.5.1", "@socket.io/redis-adapter": "^8.1.0", "redis": "^4.6.5", "socket.io": "^4.6.1", "socket.io-client": "^4.6.1",

Redis Server 6.x+ (issue also with redis server 4.x+)

Server Code Example

import { Server } from "socket.io";
/* ... */
// initialize socket io
const io = new Server(app.http, {
    noServer: true,
    cors: {
        origin: "*",
        methods: ["GET", "POST"],
        credentials: true
    }
});
const redisPub = activeRedisClient.duplicate();
await redisPub.connect();
await redisPub.ping();
const redisSub = activeRedisClient.duplicate();
await redisSub.connect();
await redisSub.ping();
io.adapter(createAdapter(redisPub, redisSub, {
    requestsTimeout: 3000
}));

const dynamicNamespace = io.of(async (name, auth, next) => {
    next(null, true);
});
dynamicNamespace.use(async (socket, next) => { next(null); });

dynamicNamespace.on('connection', async (socket) => {
   // this event sometimes not called
    socket.on('disconnect', async () => {
        // this event sometimes not added / called
    });

    socket.on('message', async () => {
        // this event sometimes not added / called
    });
});

Client

import { io } from "socket.io-client";

const socket = io("ws://localhost:3000/", {
    transports: ['websocket'],
    auth: {
        token: 'jwt-token',
    },
});

socket.on("connect", () => {
  console.log(`connect ${socket.id}`);
});

socket.on("disconnect", () => {
  console.log("disconnect");
});

Expected behavior It is expected that the event "connection" is always called on the server. But this is not the case, so the clients can do what they want, they can not dive into the normal program landscape. However, the socket connection remains. So there is a socket connection without the event "connection" being called on the server.

Platform:

Additional context Important, socket.io server running in google cloud run (docker container) and scale up/down up to the traffic, we had 250 connection at one container, a scale will happen at 150 open request.

darrachequesne commented 1 year ago

Thanks for the detailed write-up :+1:

I'm not sure how this could happen though. The multi-node setup with Redis does not seems related, as the adapter is not called during the connection.

How do you detect this kind of issues? From the client side?

Related: https://github.com/socketio/socket.io/issues/4015

s3cc0 commented 1 year ago

Hello, thanks for the quick feedback and the Related Bug, I didn't saw it in my research, sorry!

We found the bug when scaling containers and minimizing them again. In addition to that, it is noticeable by the forced removal of the TCP connection after 15-60min on the Google Cloud Run. After that, the reconnect from scoket.io client takes effect and a new websocket connection is established. However, this connection is established, but the event "connection" is not called, so no further listener like "disconnect", "message" ... are created at the socket. Because of this, the socket can't join rooms in the namespace. This became visible as the ws(s) connection was successful created, but no messages arrived (only the ping/pong heart-beat), using the socket.io admin UI we could confirmed this. there was the socket, but without rooms. I was able to add a room using the admin tool, then the socket get some messages. So the socket connection was there correctly, just the event "connection" from server was missing.

Related to the other bug, i can be something with the wilde-card namespace.

Conjecture: It could be due to the speed of the reconnect? We have also "throttled" this, without any success.

How did we get that to work? Here is a simple example:

  1. create server nodes with node cluster, redis adapter.
  2. put all in a docker-self-scaling-system (google cloud run)
  3. also put socket.io admin UI on it to get more details
  4. start a node with thre process (1master,2-client-nodes) in a cotainer
  5. connect multiple clients to the nodes so that google cloud run has to scale up
  6. wait for the forced disconnect of 15-60min (can be setup in the env, tcp socket will be killed) and/or wait for the down-scale of google cloud run.
  7. randomly a client will get a socket connection, but the "connection" call will stay off and no channel will be joined anymore.

For all, with the same issue, here is a current workaround:

  1. Implement a client message, which will be called as soon as possible after the websocket connection established
  2. Implement the callback/acknowledge of socket.io
  3. if this is not coming in a defined duration like 5seconds, try reconnect.