htrefil / rkvm

Virtual KVM switch for Linux machines
MIT License
386 stars 50 forks source link

Connectivity issues after the server reboots, goes on sleep, etc. #49

Closed JPenuchot closed 9 months ago

JPenuchot commented 9 months ago

Hi,

I noticed quite a few times that the client often fails to reconnect when a server comes back up after a reboot or being suspended. I often have to restart either or both the client and server for the system to work again. Could be a dangling socket or something like that, I'm not sure yet.

I don't have logs extracts yet but I'll produce them soon.

Regards, Jules

htrefil commented 9 months ago

Hi, I'm having the same issue too.

@ckiee - this started happenning after I got your MR for TCP timeouts in. Given your code I don't see any reason why would that happen. Do you have any idea why could that be?

ckiee commented 9 months ago

No idea, but I also get this, I hadn't made the link yet either.

Perhaps it's better to replace it with application-level heartbeats, but I've since fixed the underlying hardware problem I was having so I'm okay with a revert if you aren't looking forward to more work.

ckiee commented 9 months ago

I'm trying to reproduce this but it's not happening with some naive attempts. It takes a bit but on my system eventually everything restarts and it reconnects, which means there's something unaccounted for or it just takes a few hours to show up.

After suspending the server everything comes back relatively gracefully: (wireshark on the client) wireshark cap

JPenuchot commented 9 months ago

Maybe the issue emerges when the reboots are too short for the TCP timeout? I'll try to experiment today.

JPenuchot commented 9 months ago

So I was able to replicate the issue. Here's what happened

Sep 29 17:27:08 jpenuchot-thinkpad rkvm-server[634]: [INFO  rkvm_server::server] 192.168.1.32:33972: Connected
Sep 29 17:27:08 jpenuchot-thinkpad rkvm-server[634]: [INFO  rkvm_server::server] 192.168.1.32:33972: TLS connected
Sep 29 17:27:08 jpenuchot-thinkpad rkvm-server[634]: [INFO  rkvm_server::server] 192.168.1.32:33972: Authenticated succe
ssfully
Sep 29 17:27:14 jpenuchot-thinkpad rkvm-server[634]: [ERROR rkvm_server::server] 192.168.1.32:38446: Disconnected: Conne
ction timed out (os error 110)
Sep 29 17:27:08 jpenuchot-nzxt rkvm-client[1254]: [INFO  rkvm_client::client] Authenticated successfully
Sep 29 17:27:08 jpenuchot-nzxt rkvm-client[1254]: [INFO  rkvm_client::client] Created new device 0 (name "Video Bus", vendor 0, product 6, version 0)
Sep 29 17:27:08 jpenuchot-nzxt rkvm-client[1254]: [INFO  rkvm_client::client] Created new device 1 (name "SynPS/2 Synaptics TouchPad", vendor 2, product 7, version 433)
Sep 29 17:27:08 jpenuchot-nzxt rkvm-client[1254]: [INFO  rkvm_client::client] Created new device 2 (name "PC Speaker", vendor 31, product 1, version 256)
Sep 29 17:27:08 jpenuchot-nzxt rkvm-client[1254]: [INFO  rkvm_client::client] Created new device 3 (name "Logitech USB Receiver Keyboard", vendor 1133, product 50503, version 273)
Sep 29 17:27:08 jpenuchot-nzxt rkvm-client[1254]: [INFO  rkvm_client::client] Created new device 4 (name "Logitech USB Receiver", vendor 1133, product 50503, version 273)
Sep 29 17:27:08 jpenuchot-nzxt rkvm-client[1254]: [INFO  rkvm_client::client] Created new device 5 (name "AT Translated Set 2 keyboard", vendor 1, product 1, version 43860)
Sep 29 17:27:08 jpenuchot-nzxt rkvm-client[1254]: [INFO  rkvm_client::client] Created new device 6 (name "Power Button", vendor 0, product 1, version 0)
Sep 29 17:27:08 jpenuchot-nzxt rkvm-client[1254]: [INFO  rkvm_client::client] Created new device 7 (name "Sleep Button", vendor 0, product 3, version 0)
Sep 29 17:27:08 jpenuchot-nzxt rkvm-client[1254]: [INFO  rkvm_client::client] Created new device 8 (name "solaar-keyboard", vendor 1, product 1, version 1)

The server notified a disconnection from the previous socket (before rebooting) after the client connected successfully on a new socket. It could be that the client wrongfully gets delisted by the server.

EDIT: I was able to replicate that order but the system kept working. Also side note: the port from the previous connection had a higher number than the new connection, which wasn't the case when it failed. I'm thinking maybe the client only gets delisted when the first socket is on a lower port number because the socket list is sorted alphabetically..?

ckiee commented 9 months ago

Could you RUST_LOG=debug systemctl import-environment RUST_LOG and restart both rkvms?

JPenuchot commented 9 months ago

Client side, after this reboot the bug has been occurring:

Sep 29 17:46:58 jpenuchot-nzxt rkvm-client[1259]: [INFO  rkvm_client::client] Authenticated successfully
Sep 29 17:46:58 jpenuchot-nzxt rkvm-client[1259]: [INFO  rkvm_client::client] Created new device 0 (name "solaar-keyboard", vendor 1, product 1, version 1)
Sep 29 17:46:58 jpenuchot-nzxt rkvm-client[1259]: [INFO  rkvm_client::client] Created new device 1 (name "Video Bus", vendor 0, product 6, version 0)
Sep 29 17:46:58 jpenuchot-nzxt rkvm-client[1259]: [INFO  rkvm_client::client] Created new device 2 (name "SynPS/2 Synaptics TouchPad", vendor 2, product 7, version 433)
Sep 29 17:46:58 jpenuchot-nzxt rkvm-client[1259]: [INFO  rkvm_client::client] Created new device 3 (name "PC Speaker", vendor 31, product 1, version 256)
Sep 29 17:46:58 jpenuchot-nzxt rkvm-client[1259]: [INFO  rkvm_client::client] Created new device 4 (name "Logitech USB Receiver Keyboard", vendor 1133, product 50503, version 273)
Sep 29 17:46:58 jpenuchot-nzxt rkvm-client[1259]: [INFO  rkvm_client::client] Created new device 5 (name "Logitech USB Receiver", vendor 1133, product 50503, version 273)
Sep 29 17:46:58 jpenuchot-nzxt rkvm-client[1259]: [INFO  rkvm_client::client] Created new device 6 (name "AT Translated Set 2 keyboard", vendor 1, product 1, version 43860)
Sep 29 17:46:58 jpenuchot-nzxt rkvm-client[1259]: [INFO  rkvm_client::client] Created new device 7 (name "Power Button", vendor 0, product 1, version 0)
Sep 29 17:46:58 jpenuchot-nzxt rkvm-client[1259]: [INFO  rkvm_client::client] Created new device 8 (name "Sleep Button", vendor 0, product 3, version 0)

Server side:

Sep 29 17:43:57 jpenuchot-thinkpad rkvm-server[44941]: [INFO  rkvm_server::server] 192.168.1.32:56548: Connected
Sep 29 17:43:57 jpenuchot-thinkpad rkvm-server[44941]: [DEBUG rustls::server::hs] decided upon suite TLS13_AES_256_GCM_SHA384
Sep 29 17:43:57 jpenuchot-thinkpad rkvm-server[44941]: [INFO  rkvm_server::server] 192.168.1.32:56548: TLS connected
Sep 29 17:43:57 jpenuchot-thinkpad rkvm-server[44941]: [INFO  rkvm_server::server] 192.168.1.32:56548: Authenticated successfully
Sep 29 17:44:17 jpenuchot-thinkpad rkvm-server[44941]: [DEBUG rkvm_server::server] Switched to client 1
Sep 29 17:44:33 jpenuchot-thinkpad rkvm-server[44941]: [DEBUG rkvm_server::server] Switched to client 2
Sep 29 17:44:33 jpenuchot-thinkpad rkvm-server[44941]: [ERROR rkvm_server::server] 192.168.1.32:44588: Disconnected: Connection timed out (os error 110)
Sep 29 17:44:35 jpenuchot-thinkpad rkvm-server[44941]: [DEBUG rkvm_server::server] Switched to client 1
Sep 29 17:44:37 jpenuchot-thinkpad rkvm-server[44941]: [DEBUG rkvm_server::server] Switched to client 0
Sep 29 17:46:17 jpenuchot-thinkpad rkvm-server[44941]: [DEBUG rkvm_server::server] Switched to client 1
Sep 29 17:46:18 jpenuchot-thinkpad rkvm-server[44941]: [DEBUG rkvm_server::server] Switched to client 0
Sep 29 17:46:58 jpenuchot-thinkpad rkvm-server[44941]: [INFO  rkvm_server::server] 192.168.1.32:38522: Connected
Sep 29 17:46:58 jpenuchot-thinkpad rkvm-server[44941]: [DEBUG rustls::server::hs] decided upon suite TLS13_AES_256_GCM_SHA384
Sep 29 17:46:58 jpenuchot-thinkpad rkvm-server[44941]: [INFO  rkvm_server::server] 192.168.1.32:38522: TLS connected
Sep 29 17:46:58 jpenuchot-thinkpad rkvm-server[44941]: [INFO  rkvm_server::server] 192.168.1.32:38522: Authenticated successfully
Sep 29 17:47:08 jpenuchot-thinkpad rkvm-server[44941]: [DEBUG rkvm_server::server] Switched to client 1
Sep 29 17:47:08 jpenuchot-thinkpad rkvm-server[44941]: [ERROR rkvm_server::server] 192.168.1.32:56548: Disconnected: Connection timed out (os error 110)
Sep 29 17:47:10 jpenuchot-thinkpad rkvm-server[44941]: [DEBUG rkvm_server::server] Switched to client 0
Sep 29 17:47:13 jpenuchot-thinkpad rkvm-server[44941]: [DEBUG rkvm_server::server] Switched to client 0
Sep 29 17:47:15 jpenuchot-thinkpad rkvm-server[44941]: [DEBUG rkvm_server::server] Switched to client 0

The bug doesn't occur on the first reconnect, but it does after the reconnect at 17:46:58

JPenuchot commented 9 months ago

Again it happens when the port of the old socket is higher than the port of the new socket. Might be a coincidence but worth noting.

ckiee commented 9 months ago

Can you try this?

diff --git a/rkvm-server/src/server.rs b/rkvm-server/src/server.rs
index 99fa531..d242c7d 100644
--- a/rkvm-server/src/server.rs
+++ b/rkvm-server/src/server.rs
@@ -63,9 +63,9 @@ pub async fn run(

                 // Remove dead clients.
                 clients.retain(|_, client: &mut Sender<_>| !client.is_closed());
-                if !clients.contains(current) {
-                    current = 0;
-                }
+                // if !clients.contains(current) {
+                //     current = 0;
+                // }

                 let init_updates = devices
                     .iter()
@@ -82,7 +82,7 @@ pub async fn run(
                     .collect();

                 let (sender, receiver) = mpsc::channel(1);
-                clients.insert(sender);
+                current = clients.insert(sender);

                 let span = tracing::info_span!("connection", addr = %addr);
                 tokio::spawn(
JPenuchot commented 9 months ago

Sure! Could you apply this patch in a separate branch? I'm using rkvm through the AUR package but I'm not very familiar with patching, it would be quicker and easier for me to modify the pkgfile to simply track another branch.

ckiee commented 9 months ago

sure, https://github.com/ckiee/rkvm/tree/debug-2023-09-29

JPenuchot commented 9 months ago

Looks to me like it's still happening. Same behavior as before. I double-checked the sources after running makepkg on a PKGBUILD file that tracks your branch and the patch was applied as it should be.

NB: I installed it on both machines

htrefil commented 9 months ago

Hey everyone, thanks for the help and sorry for the delay. I reverted @ckiee's changes and looking into adding application level heartbeat so that I can release it for 0.5.0.

htrefil commented 9 months ago

Fixed in 0.5.0.