thelastpickle / cassandra-medusa

Apache Cassandra Backup and Restore Tool
Apache License 2.0
266 stars 143 forks source link

SSH Timeout Error on first backup-cluster or restore #689

Closed lopesjp closed 1 month ago

lopesjp commented 1 year ago

Project board link

Hi :wave:

We have been trying Medusa on a Cassandra cluster with 10 nodes and around 1 TB of data.

To perform the first backup of the cluster, we are running the medusa backup-cluster --backup-name <name> --mode differential.

These backups are going to a S3 bucket, which do not have any backups, so I assume that although is differential, basically does a full backup?!

Nonetheless, the curious thing we found was that for the last nodes, the backup was incomplete. We ran the command on node 01 at 17:28, which ran until 19:15, checking the logs from the process we noticed an error

[2023-11-27 19:15:38,899] ERROR: This error happened during the cluster backup: (-1, b'Socket error: disconnected')
Traceback (most recent call last):
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/backup_cluster.py", line 64, in orchestrate
    backup.execute(cql_session_provider)
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/backup_cluster.py", line 150, in execute
    self._upload_backup()
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/backup_cluster.py", line 172, in _upload_backup
    hosts_variables={})
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/orchestration.py", line 83, in pssh_run
    display_output(error)
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/medusa/orchestration.py", line 26, in display_output
    for line in host_out.stdout:
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/pssh/clients/base/single.py", line 246, in read_output_buffer
    for line in output_buffer:
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/pssh/clients/ssh/single.py", line 301, in _read_output
    _reader.get(timeout=timeout)
  File "src/gevent/greenlet.py", line 805, in gevent._gevent_cgreenlet.Greenlet.get
  File "src/gevent/greenlet.py", line 373, in gevent._gevent_cgreenlet.Greenlet._raise_exception
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/gevent/_compat.py", line 66, in reraise
    raise value.with_traceback(tb)
  File "src/gevent/greenlet.py", line 908, in gevent._gevent_cgreenlet.Greenlet.run
  File "/opt/cassandra-medusa/lib64/python3.6/site-packages/pssh/clients/ssh/single.py", line 324, in _read_output_to_buffer
    size, data = channel.read_nonblocking(is_stderr=is_stderr)
  File "ssh/channel.pyx", line 189, in ssh.channel.Channel.read_nonblocking
  File "ssh/utils.pyx", line 94, in ssh.utils.handle_error_codes
ssh.exceptions.SSHError: (-1, b'Socket error: disconnected')

This error causes it to crash and trigger the clean-up of the snapshots.

[2023-11-27 19:15:38,898] ERROR: Job executing "mkdir -p /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef; cd /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef && medusa-wrapper sudo medusa  -vvv backup-node --backup-name test-all-cluster-drk   --mode differential" ran and finished with errors on following nodes...
[2023-11-27 19:15:38,898] DEBUG: Waiting for stdout reader
[2023-11-27 19:15:38,899] ERROR: This error happened during the cluster backup: (-1, b'Socket error: disconnected')
...
[2023-11-27 19:15:38,907] ERROR: Something went wrong! Attempting to clean snapshots and exit.
...
[2023-11-27 19:15:54,756] INFO: Job executing "nodetool -Dcom.sun.jndi.rmiURLParsing=legacy -u <user> -pw <password> clearsnapshot -t medusa-test-all-cluster-drk" ran and finished Successfully on all nodes.
[2023-11-27 19:15:54,757] INFO: All nodes successfully cleared their snapshot.

Meanwhile, on one of the nodes that was incomplete, it shows that the upload was still ongoing, and it stopped due to the fact that the snapshot was no longer there.

[2023-11-27 19:15:11,897] DEBUG: [https://s3-eu-west-1.amazonaws.com:443](https://s3-eu-west-1.amazonaws.com/) "PUT /sbk-medusa-prd/data/fbp_markets_journal_store/messages-fd3f4700de1a11eabb225d971d2361
fe/mc-455-big-TOC.txt HTTP/1.1" 200 0
...
[2023-11-27 19:20:21,347] ERROR: Error occurred during backup: awscli cp failed. Max attempts exceeded. Check /tmp/awscli_50571839-2d02-4c4d-8c52-afdad2e56732.output for more informations.
...
The user-provided path /var/lib/cassandra/data/fbp_markets_journal_store/messages-fd3f4700de1a11eabb225d971d2361fe/snapshots/medusa-test-all-cluster-drk/mc-6305-big-Data.db does not exist.

I assume this happens because it takes much time to do the backup of all nodes and all data, and the SSH connection might time out.

But what is this SSH connection? Is the first node that opens this connection to all nodes to execute this command mkdir -p /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef; cd /tmp/medusa-job-40c0e132-2a9c-4822-b179-4b1150b0b7ef && medusa-wrapper sudo medusa -vvv backup-node --backup-name test-all-cluster-drk --mode differential? Is the connection open at the beginning and kept until all nodes have finished the backup?

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: MED-18

lopesjp commented 11 months ago

As an update to this, it is clearly related with the SSH keep-alive. Happens when performing the first backup of high volume data or a restore.

On both actions, an SSH connection is established that requires to be open until the end. This has happened for a cluster of 10 nodes and more than 1Tb.

Is there a recommended approach for maintaining the Medusa SSH connection without resorting to constantly adjusting the SSH keep-alive interval settings on the node from which the command is executed?