some rmb-peers connections are stalling and showing CLOSE_WAIT stat

muhamadazmy commented 9 months ago

We found that some nodes rmb connections are stuck in CLOSE_WAIT state. These should be closed but they don't terminate.

The recv queue also shows that some messages have been received on that socket but never read by the application (rmb) it means that rmb got stock on reading data from the socket and is not reading data anymore from the low level connection

This can be:

WebSocket bug ?
Reading routing somehow got blocked (stopped reading) then socket was closed from remote end. Hence kernel is still waiting for the process to clean up but never happens.

sameh-farouk commented 9 months ago

Are there any panics in the logs?

I looked at this for couple of days and here is my update:

What is the CLOSE_WAIT state: When one side closes the connection, the socket at the other side transitions to the CLOSE_WAIT state. The CLOSE_WAIT state indicates that the socket is closed on the remote side (relay side), and the system is waiting for the process (RMB peer) to close it. Typically, when the remote side closes the connection (Send FIN), the process is aware of it and closes it as well (send ACK + FIN). However, if the process becomes unresponsive and the remote side closes the connection, the socket is never closed. So, if this happens, the socket stays in the CLOSE_WAIT state.

what I tried so far:

performed multiple tests and applied stress loads to the RMB, but I was unable to reproduce the issue. I suspect that system load might not be the trigger.
conducted a thorough code review to identify any potential concurrency issues, but I have not yet found anything amiss. however, I have a limited experience with rust so more experienced Rustaceans are welcomed to peek a look.

How we could fix this: I’m uncertain about the root cause of the issue, and without access to the production node experiencing this problem, it would be challenging to pinpoint potential causes.

One suggestion that has been made so far is to update the websocket crates dependencies, which has already been addressed in this pull request https://github.com/threefoldtech/rmb-rs/pull/159.

I can offer a couple of additional suggestions:

Implement additional internal concurrency/monitoring on the read code. Using Tokio tracing feature might be an option here, as it provides tracing events for each task entry or suspension. This could help detect if a task is blocked, for example.. and there is an interesting tool called Tokio Console that offers monitoring capabilities and is built by the Tokio team.
We should also ensure that zos are using an updated kernel with well-optimized TCP/IP parameters. thought, in this connection state, it should be the application’s responsibility to drop the connection, as the operating system has less involvement with connections in this state.

So what might be the reason behind this?:

My Best Guess: I agree with azamy, and based on the information provided, it seems to be a bug on one of the dependencies lib or a socket concurrency handling issue in the RMB code. The waker may not have been notified, or a poll implementation could be stuck in a loop, preventing other futures, including the one driving message reading, from being polled in time. regarding the waker mechanism, I find some interesting reading material on this topic at this blog post.

Another less likely guesses:

Guess 2: It’s possible that the socket are not being closed because the client believe that it still a live, as if the code may be waiting for an event that never occurs. unlikely unless a tokio-tungstenite bug. in the retainer code, handle the end of stream should be enough while reading messages, but i think we can match against a close frame (Message::Close) as well in the retainer just in case. also we should filter control frames in the retainer, we filter only Message::PONG. but why the server side initiate a close in the first place if the client functioning properly?
Guess 3: Although unlikely, there could be a deadlock. I noticed a couple of places where we finish using the lock and can dispose of it early instead of holding it across .await. This can be considered as optimization but unlikely to help with this issue. also The current_thread tokio runtime flavor should be an option here, unless proven otherwise, as we only spawning a few tasks and opening few sockets. current_thread is a lightweight, single-threaded runtime in which mutex will never be contended.
Guess 4: It’s possible that server-side TCP accept queue overflow can cause the server to cut the connection and leave it half-open. However, I believe that the retainer should detect this and drop the connection as well.

your thoughts @muhamadazmy ?

muhamadazmy commented 9 months ago

Adding to your thought:

Rmb sends heartbeat messages (pings) to the server and expect responses every 20 seconds (as far as I remember) if for 2x that time there are no pongs are received the connection is considered "dead" and is dropped and a new connection is created!
- The weird thing is that detection is not happening in the "stuck" peers for some reason although it's obvious the application level code cannot deliver or receive pings.

This means according to this code here https://github.com/threefoldtech/rmb-rs/blob/main/src/peer/socket.rs#L135 this branch must get "unblocked" by the ping send! which then recalculate when was the last time a pong was received. If the last pong received was too long ago, the connection is dropped and a new connection is created. But for some reason this never resolved as if the select! is itself blocked.

Another possibility is that the logic that handled a received messages blocks at some point in code that prevents the loop from calling the select again which brings the entire reading/sending to stall. Maybe we need to look into this.

sameh-farouk commented 8 months ago

I guess this can be closed and re-opened only if there are new reports that this still happening with the new release @muhamadazmy

sameh-farouk commented 6 months ago

reopening this issue as we have new reports on devnet that this still happening with the latest release

sameh-farouk commented 6 months ago

After checking node 29, I can see that the connection status is ESTAB

ip netns exec ndmz ss -npt | grep rmb
ESTAB     0          0                                     10.4.0.135:51384                      185.206.122.7:443      users:(("rmb",pid=30446,fd=9))

This is different than previous reports, and should be tracked in different issue. closing this one.

threefoldtech / rmb-rs

some rmb-peers connections are stalling and showing CLOSE_WAIT stat #158