Closed lopesjp closed 1 month ago
As an update to this, it is clearly related with the SSH keep-alive. Happens when performing the first backup of high volume data or a restore.
On both actions, an SSH connection is established that requires to be open until the end. This has happened for a cluster of 10 nodes and more than 1Tb.
Is there a recommended approach for maintaining the Medusa SSH connection without resorting to constantly adjusting the SSH keep-alive interval settings on the node from which the command is executed?
Project board link
Hi :wave:
We have been trying Medusa on a Cassandra cluster with 10 nodes and around 1 TB of data.
To perform the first backup of the cluster, we are running the
medusa backup-cluster --backup-name <name> --mode differential
.These backups are going to a S3 bucket, which do not have any backups, so I assume that although is differential, basically does a full backup?!
Nonetheless, the curious thing we found was that for the last nodes, the backup was incomplete. We ran the command on node 01 at 17:28, which ran until 19:15, checking the logs from the process we noticed an error
This error causes it to crash and trigger the clean-up of the snapshots.
Meanwhile, on one of the nodes that was incomplete, it shows that the upload was still ongoing, and it stopped due to the fact that the snapshot was no longer there.
I assume this happens because it takes much time to do the backup of all nodes and all data, and the SSH connection might time out.
But what is this SSH connection? Is the first node that opens this connection to all nodes to execute this command
mkdir -p /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef; cd /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef && medusa-wrapper sudo medusa -vvv backup-node --backup-name test-all-cluster-drk --mode differential
? Is the connection open at the beginning and kept until all nodes have finished the backup?┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: MED-18