threefoldtech / rmb-rs

RMB implementation in rust
Apache License 2.0
3 stars 1 forks source link

some rmb-peers connections are stalling and showing CLOSE_WAIT stat #158

Closed muhamadazmy closed 6 months ago

muhamadazmy commented 9 months ago

We found that some nodes rmb connections are stuck in CLOSE_WAIT state. These should be closed but they don't terminate.

image

The recv queue also shows that some messages have been received on that socket but never read by the application (rmb) it means that rmb got stock on reading data from the socket and is not reading data anymore from the low level connection

This can be:

sameh-farouk commented 9 months ago

Are there any panics in the logs?

I looked at this for couple of days and here is my update:

What is the CLOSE_WAIT state: When one side closes the connection, the socket at the other side transitions to the CLOSE_WAIT state. The CLOSE_WAIT state indicates that the socket is closed on the remote side (relay side), and the system is waiting for the process (RMB peer) to close it. Typically, when the remote side closes the connection (Send FIN), the process is aware of it and closes it as well (send ACK + FIN). However, if the process becomes unresponsive and the remote side closes the connection, the socket is never closed. So, if this happens, the socket stays in the CLOSE_WAIT state.

what I tried so far:

How we could fix this: I’m uncertain about the root cause of the issue, and without access to the production node experiencing this problem, it would be challenging to pinpoint potential causes.

I can offer a couple of additional suggestions:

So what might be the reason behind this?:

Another less likely guesses:

your thoughts @muhamadazmy ?

muhamadazmy commented 9 months ago

Adding to your thought:

This means according to this code here https://github.com/threefoldtech/rmb-rs/blob/main/src/peer/socket.rs#L135 this branch must get "unblocked" by the ping send! which then recalculate when was the last time a pong was received. If the last pong received was too long ago, the connection is dropped and a new connection is created. But for some reason this never resolved as if the select! is itself blocked.

Another possibility is that the logic that handled a received messages blocks at some point in code that prevents the loop from calling the select again which brings the entire reading/sending to stall. Maybe we need to look into this.

sameh-farouk commented 8 months ago

I guess this can be closed and re-opened only if there are new reports that this still happening with the new release @muhamadazmy

sameh-farouk commented 6 months ago

reopening this issue as we have new reports on devnet that this still happening with the latest release

sameh-farouk commented 6 months ago

After checking node 29, I can see that the connection status is ESTAB

ip netns exec ndmz ss -npt | grep rmb
ESTAB     0          0                                     10.4.0.135:51384                      185.206.122.7:443      users:(("rmb",pid=30446,fd=9))          

This is different than previous reports, and should be tracked in different issue. closing this one.