Potential issue when using inside a K8s cluster with dynamic scaling of server instances

dhardtke commented 1 year ago

Hi!

We are very grateful that we were able to migrate from the redis adapter to this socket.io adapter to get rid of one dependency since Mongo is our DB anyway.

However, we are currently investigating an issue where our backends are behaving very strangely when after a so called Rolling Update in our K8s cluster where new backend instances are spawned and the old ones are shut down once the new ones are ready, some backends are unable to deliver any socket messages to clients connected to other backends. Unfortunately, since a couple of days it is also happening without any deployments.

What we observed It seems like the backends are sending heartbeat signals (though we could not see them in the DB collection because our collection size was quite small: 1 MB) so with 6 backends every socket message requires 5 other backends to respond. However, even after tweaking the requestsTimeout, we still see timeout reached: only 0 responses received out of 5 or 4/5 in our logs. And today, we noticed the message kept showing up even though the backends themselves were all running perfectly fine.

Questions

Does anyone have any recommendations or is it not possible to use this adapter in a K8s cluster with dynamic scaling where a node can potentially go offline any minute and new ones can come up?
Is it an issue if the capped collection size is reached due to large socket objects or a large number of connected sockets?

I am not sure there is an issue in the adapter so this issue is not just a potential bug report but also an ask for a pointer in the right direction. We (3 senior devs) have been speculating and pondering about this all day and have not come up with a solution (yet). Thank you :)

darrachequesne commented 1 year ago

Hi! I think the problem is that during the rolling upgrade, the server that sends a request sees more servers than expected, so the request eventually times out. See here.

You could play with the values of heartbeatInterval (5s by default) and heartbeatTimeout (10s) to reduce the size of the window, so that the deleted pods are removed more quickly. The failure window will still exist though, so you will certainly need to retry the request.

Now, should the retry mechanism be implemented in the library itself? I'm open to discuss about that.

dhardtke commented 1 year ago

Thank you. We are already running the backends with a heartbeatInterval of 2.5s and heartbeatTimeout of 5s. It still happens, though, and our primary concern is that it also happens during regular operation of the backends (i.e., even when no rolling upgrade occurs).

And what's even stranger is that the backends do not recover from this, so they are unable to hold any socket connection open. We have to restart them manually then.

In my investigation when such a timeout happens it seems like the other backends did correctly answer but the requesting backend did not receive the inserted document in time (even with a capped collection size of 1 GB).

darrachequesne commented 1 year ago

And what's even stranger is that the backends do not recover from this, so they are unable to hold any socket connection open. We have to restart them manually then.

That's weird indeed, as the adapter part should be rather independent from the connection handling. I will try to reproduce the issue locally.

Do you know how many documents are inserted per second? What is the state of the change stream, according to the logs?

socketio / socket.io-mongo-adapter

Potential issue when using inside a K8s cluster with dynamic scaling of server instances #16