[Bug]: The connection between the servers is interrupted for a second if the connection to the server(netmaker) is lost

josifpeev commented 5 months ago

Contact Details

djimbo84@gmail.com

What happened?

Hello, I'm using Netmaker, and we have a few machines connected in a mesh. We also have databases (percona xtradb cluster) that are within the VPN mesh and communicating through it. There's a strange issue - when the internet connection to the server (Netmaker) is interrupted, intentionally shut down, or docker is stopped (during backup), the connection is lost momentarily (for a second), and the databases disconnect from each other, reconnect and begin to sync once again (in this time the cluster is in non-primary state which is bad). The event is caught even by Nagios, showing that nodes (many nodes, not just the DBs) are not responding to PING checks, and after the second try, they are good again. Even after the Netmaker server is off, the connections rebuild between the nodes in the VPN mesh. Such interruption isn't noticed when we start the Netmaker server (only when it goes down), but if we stop it again, the situation repeats. How can we avoid this interruption? Any ideas?

Version

v0.23.0

What OS are you using?

Linux

Relevant log output

No response

Contributing guidelines

[X] Yes, I did.

abhishek9686 commented 5 months ago

when you say the server is shut down, I believe all containers including mq is shut down? So this triggers a connection lost handler on the client which causes it to restart the client to re-initiate connection to broker, that's the reason you see connection loss for a few seconds.

josifpeev commented 5 months ago

We tried stopping containers one by one (stop, wait and monitor, and then start). The issue arises with MQ and Caddy. Either one of the two triggers the connection disruption, so it's not just the MQ.. Is there any way to create a backup without interrupting the connection (we presume the database/caddy should be stopped for a proper backup)?

abhishek9686 commented 5 months ago

We tried stopping containers one by one (stop, wait and monitor, and then start). The issue arises with MQ and Caddy. Either one of the two triggers the connection disruption, so it's not just the MQ.. Is there any way to create a backup without interrupting the connection (we presume the database/caddy should be stopped for a proper backup)?

yes both MQ and caddy could cause the connection to interrupt, a workaround for this would be hosting MQ on a separate machine where there is no interruption

josifpeev commented 5 months ago

We have no problem not backing up the MQ (so we will not stop it), it doesn't retain any important data anyway. But we must backup the caddy database, so we must stop it. Stopping the caddy will interrupt the connection. So how it is supposed to backup the container data without stopping it.

abhishek9686 commented 5 months ago

We have no problem not backing up the MQ (so we will not stop it), it doesn't retain any important data anyway. But we must backup the caddy database, so we must stop it. Stopping the caddy will interrupt the connection. So how it is supposed to backup the container data without stopping it.

why is the caddy backup needed?

josifpeev commented 5 months ago

Yes, you are right. It's working now. Thank you!

gravitl / netmaker