rmb was running version v1.0.2-rc2 (this had a buggy subtx client)
rmb logs shows it tried to reconnect to rmb for a while.
on the affected nodes, it looks like rmb has established connection to the relay (hence it wasn't trying to reconnect)
we have a keepalive mechanism why this did not detect that the connection was stalling ?!
sending a message to the affected node does not show any bytes on the socket send/recv queues
killing the stuck rmb makes it actually received all messages that has been queuing on the relay side ! it means that the relay itself did not have the rmb connected before, only when rmb is killed and reconnected the relay pushes the waiting messages to it
The question is:
Why tcp connection did not timeout after this long
Why the keepalive mechanism did not kick in and detect the stalling connection
Noticed on the nodes that has been affected:
v1.0.2-rc2
(this had a buggy subtx client)The question is: