Open dhardtke opened 1 year ago
Hi! I think the problem is that during the rolling upgrade, the server that sends a request sees more servers than expected, so the request eventually times out. See here.
You could play with the values of heartbeatInterval
(5s by default) and heartbeatTimeout
(10s) to reduce the size of the window, so that the deleted pods are removed more quickly. The failure window will still exist though, so you will certainly need to retry the request.
Now, should the retry mechanism be implemented in the library itself? I'm open to discuss about that.
Thank you. We are already running the backends with a heartbeatInterval
of 2.5s and heartbeatTimeout
of 5s. It still happens, though, and our primary concern is that it also happens during regular operation of the backends (i.e., even when no rolling upgrade occurs).
And what's even stranger is that the backends do not recover from this, so they are unable to hold any socket connection open. We have to restart them manually then.
In my investigation when such a timeout happens it seems like the other backends did correctly answer but the requesting backend did not receive the inserted document in time (even with a capped collection size of 1 GB).
And what's even stranger is that the backends do not recover from this, so they are unable to hold any socket connection open. We have to restart them manually then.
That's weird indeed, as the adapter part should be rather independent from the connection handling. I will try to reproduce the issue locally.
Do you know how many documents are inserted per second? What is the state of the change stream, according to the logs?
Hi!
We are very grateful that we were able to migrate from the redis adapter to this socket.io adapter to get rid of one dependency since Mongo is our DB anyway.
However, we are currently investigating an issue where our backends are behaving very strangely when after a so called Rolling Update in our K8s cluster where new backend instances are spawned and the old ones are shut down once the new ones are ready, some backends are unable to deliver any socket messages to clients connected to other backends. Unfortunately, since a couple of days it is also happening without any deployments.
What we observed It seems like the backends are sending heartbeat signals (though we could not see them in the DB collection because our collection size was quite small: 1 MB) so with 6 backends every socket message requires 5 other backends to respond. However, even after tweaking the requestsTimeout, we still see timeout reached: only 0 responses received out of 5 or 4/5 in our logs. And today, we noticed the message kept showing up even though the backends themselves were all running perfectly fine.
Questions
I am not sure there is an issue in the adapter so this issue is not just a potential bug report but also an ask for a pointer in the right direction. We (3 senior devs) have been speculating and pondering about this all day and have not come up with a solution (yet). Thank you :)