Closed muhamadazmy closed 6 months ago
Are there any panics in the logs?
I looked at this for couple of days and here is my update:
What is the CLOSE_WAIT state: When one side closes the connection, the socket at the other side transitions to the CLOSE_WAIT state. The CLOSE_WAIT state indicates that the socket is closed on the remote side (relay side), and the system is waiting for the process (RMB peer) to close it. Typically, when the remote side closes the connection (Send FIN), the process is aware of it and closes it as well (send ACK + FIN). However, if the process becomes unresponsive and the remote side closes the connection, the socket is never closed. So, if this happens, the socket stays in the CLOSE_WAIT state.
what I tried so far:
How we could fix this: I’m uncertain about the root cause of the issue, and without access to the production node experiencing this problem, it would be challenging to pinpoint potential causes.
I can offer a couple of additional suggestions:
So what might be the reason behind this?:
Another less likely guesses:
Guess 2: It’s possible that the socket are not being closed because the client believe that it still a live, as if the code may be waiting for an event that never occurs. unlikely unless a tokio-tungstenite bug. in the retainer code, handle the end of stream should be enough while reading messages, but i think we can match against a close frame (Message::Close) as well in the retainer just in case. also we should filter control frames in the retainer, we filter only Message::PONG. but why the server side initiate a close in the first place if the client functioning properly?
Guess 3: Although unlikely, there could be a deadlock. I noticed a couple of places where we finish using the lock and can dispose of it early instead of holding it across .await. This can be considered as optimization but unlikely to help with this issue. also The current_thread
tokio runtime flavor should be an option here, unless proven otherwise, as we only spawning a few tasks and opening few sockets. current_thread
is a lightweight, single-threaded runtime in which mutex will never be contended.
Guess 4: It’s possible that server-side TCP accept queue overflow can cause the server to cut the connection and leave it half-open. However, I believe that the retainer should detect this and drop the connection as well.
your thoughts @muhamadazmy ?
Adding to your thought:
This means according to this code here https://github.com/threefoldtech/rmb-rs/blob/main/src/peer/socket.rs#L135 this branch must get "unblocked" by the ping
send! which then recalculate when was the last time a pong was received. If the last pong received was too long ago, the connection is dropped and a new connection is created. But for some reason this never resolved as if the select!
is itself blocked.
Another possibility is that the logic that handled a received messages blocks at some point in code that prevents the loop from calling the select again which brings the entire reading/sending to stall. Maybe we need to look into this.
I guess this can be closed and re-opened only if there are new reports that this still happening with the new release @muhamadazmy
reopening this issue as we have new reports on devnet that this still happening with the latest release
After checking node 29, I can see that the connection status is ESTAB
ip netns exec ndmz ss -npt | grep rmb
ESTAB 0 0 10.4.0.135:51384 185.206.122.7:443 users:(("rmb",pid=30446,fd=9))
This is different than previous reports, and should be tracked in different issue. closing this one.
We found that some nodes rmb connections are stuck in CLOSE_WAIT state. These should be closed but they don't terminate.
The recv queue also shows that some messages have been received on that socket but never read by the application (rmb) it means that rmb got stock on reading data from the socket and is not reading data anymore from the low level connection
This can be: