socketio / socket.io-sticky

A simple and performant way to use Socket.IO within a cluster.
https://socket.io
MIT License
43 stars 13 forks source link

HTTP 1.1 connection reuse breaking long-polling #8

Open JoniVanderheijden opened 2 years ago

JoniVanderheijden commented 2 years ago

Hello! We've been trying to integrate this library into our code base, we use Socket IO quite generously and balancing traffic based on the sid combined with not using the master node process as a proxy sounded like a very appealing idea.

We've hit a couple of roadblocks along the way, and I'm wondering if it's just on how we've integrated it. I wanted to share the issues I struggled with and seek some guidance from others who have looked at this issue before. I'm on the verge of going back to proxying requests in the master and balancing based on SID that way.

  1. I couldn't use the setupWorker function out of the box, we have multiple Socket IO servers attached to our HTTP server (to cater for multiple paths), and if I run setupWorker multiple times it attached multiple process message listeners. I had to split the function apart into the Engine IO connection listener (which I attached to each Socket IO server) and the process one which I made sure was only attached once.

  2. We have some serious issues with long-polling. From what I understand browsers reuse connections under the hood when using HTTP 1.1, and that causes some serious issues for long polling. For example the handshake of a request could come in on Worker 1 and that worker would 'claim' the sid because the Engine IO connection event gets triggered first. But it could be that requests are sent over another connection and are routed to a completely different worker, which causes the dreaded Session ID unknown error. I think the mechanism of transferring connections to the Worker doesn't align well with how connections are used in a HTTP 1.1 context, and it seems polling messages can end up on any open connection independent of what Worker it's connected to. An important consequence for this is also that sticky sessions persist longer than they have to (longer than the existence of the Socket IO socket) as the new Socket gets "claimed" by workers often because the handshake tends to share the open connection used by the old socket connection and hits the worker directly without passing through the master. That essentially means all requests for a user are very likely to end up on a single Worker.

  3. Given the above, I'm a bit skeptical about the least connection load balancing method, it's counting based on the Engine IO connection which doesn't directly seem to correspond to the actual open connection on the HTTP server. Should that be switched to counting connection through event listeners on the HTTP server instead or are the Engine IO connections a good approximation of actual connections to Workers?

darrachequesne commented 2 years ago

Hi! Sorry for the delay.

we have multiple Socket IO servers attached to our HTTP server

This is indeed a use case that is not currently covered by the library. We intercept the connection and the first frame of the HTTP connection, but the request path is not taken in account:

https://github.com/socketio/socket.io-sticky/blob/56fc57281479ee2895de33eb30c4e5d8cf1722df/index.js#L59-L74

That being said, I guess we could check if the path of the Socket.IO server matches the request path (in the setupWorker() method). What do you think?

But it could be that requests are sent over another connection and are routed to a completely different worker.

Hmm, that should not happen. Are you able to reproduce this behavior?

The worker should send the sid to the master, which stores the relationship between the worker IDs and the session IDs:

https://github.com/socketio/socket.io-sticky/blob/56fc57281479ee2895de33eb30c4e5d8cf1722df/index.js#L76-L91

JoniVanderheijden commented 2 years ago

That being said, I guess we could check if the path of the Socket.IO server matches the request path (in the setupWorker() method). What do you think?

That's likely not going to work, we really only want to attach the process listener only once and the engine IO listeners for each socket IO server. I split setupWorker up in two parts for our integration.

Hmm, that should not happen. Are you able to reproduce this behavior? he worker should send the sid to the master, which stores the relationship between the worker IDs and the session IDs

I can, just to be clear, the worker is sending the sid to the master and the master is storing that relationship. The issue is the browser behaviour. The cause is that TCP connections are reused, and because the master node transfers the TCP socket to the worker, requests don't come through the master node at all, they go straight to a worker. That's an issue when we switch sids often. You get sporadic issues where requests end up on the wrong worker, those are pretty common when you have a lot of long polling.

I'll try to create a minimal reproducible repo if I get some time. I was thinking about how I could change this library to make it work, but don't see a solution where we don't go back to proxying all requests in the master node or not supporting long polling at all.