thelastpickle / cassandra-medusa

Apache Cassandra Backup and Restore Tool
Apache License 2.0
266 stars 143 forks source link

restore-cluster error starting cassandra #822

Closed chrisjmiller1 closed 9 hours ago

chrisjmiller1 commented 3 days ago

Project board link

Hi folks,

Testing restore-cluster using medusa 0.22.3.

restore is working perfectly but failing at the last step i.e. to start cassandra.

stdout has the following: [2024-11-26 10:27:22,776] INFO: Executing "mkdir -p /tmp/medusa-job-66ff7a96-9771-41f5-972a-2cbdaed9086f; cd /tmp/medusa-job-66ff7a96-9771-41f5-972a-2cbdaed9086f && medusa-wrapper sudo medusa --fqdn=%s -vvv restore-node --in-place %s --no-verify --backup-name backup6 --temp-dir /tmp " on following nodes ['mxiad-tfdevmet01', 'mxiad-tfdevmet02', 'mxiad-tfdevmet03'] with a parallelism/pool size of 3 [2024-11-26 10:28:01,975] ERROR: Job executing "mkdir -p /tmp/medusa-job-66ff7a96-9771-41f5-972a-2cbdaed9086f; cd /tmp/medusa-job-66ff7a96-9771-41f5-972a-2cbdaed9086f && medusa-wrapper sudo medusa --fqdn=%s -vvv restore-node --in-place %s --no-verify --backup-name backup6 --temp-dir /tmp " ran and finished with errors on following nodes: ['mxiad-tfdevmet01', 'mxiad-tfdevmet02', 'mxiad-tfdevmet03'] [2024-11-26 10:28:01,976] ERROR: Some nodes failed to restore. Exiting [2024-11-26 10:28:01,976] ERROR: This error happened during the cluster restore: Some nodes failed to restore. Exiting Traceback (most recent call last): File "/usr/local/lib/python3.9/site-packages/medusa/restore_cluster.py", line 72, in orchestrate restore.execute() File "/usr/local/lib/python3.9/site-packages/medusa/restore_cluster.py", line 155, in execute self._restore_data() File "/usr/local/lib/python3.9/site-packages/medusa/restore_cluster.py", line 410, in _restore_data raise RuntimeError(err_msg) RuntimeError: Some nodes failed to restore. Exiting

medusa.log has the following: [2024-11-26 09:28:49,101] INFO: Starting Cassandra [2024-11-26 09:28:49,101] DEBUG: Starting Cassandra with ['cassandra'] [2024-11-26 09:28:49,397] DEBUG: Disconnecting from S3...

whereas stderr has the following: [2024-11-26 09:28:59,102] INFO: Starting Cassandra [2024-11-26 09:28:59,102] DEBUG: Starting Cassandra with ['/opt/imail1/cassandra/bin/cassandra'] [2024-11-26 09:28:59,411] DEBUG: Disconnecting from S3... Traceback (most recent call last): File "/usr/local/bin/medusa", line 8, in <module> sys.exit(cli()) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/click/decorators.py", line 92, in new_func return ctx.invoke(f, obj, *args, **kwargs) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/medusa/medusacli.py", line 275, in restore_node medusa.restore_node.restore_node(medusaconfig, Path(temp_dir), backup_name, in_place, keep_auth, seeds, File "/usr/local/lib/python3.9/site-packages/medusa/restore_node.py", line 50, in restore_node restore_node_locally(config, temp_dir, backup_name, in_place, keep_auth, seeds, storage, File "/usr/local/lib/python3.9/site-packages/medusa/restore_node.py", line 137, in restore_node_locally cassandra.start_with_implicit_token() File "/usr/local/lib/python3.9/site-packages/medusa/cassandra_utils.py", line 650, in start_with_implicit_token subprocess.check_output(cmd) File "/usr/local/lib64/python3.9/site-packages/gevent/subprocess.py", line 418, in check_output raise CalledProcessError(retcode, process.args, output=output) subprocess.CalledProcessError: Command '['/opt/imail1/cassandra/bin/cassandra']' returned non-zero exit status 1.

I believe this is due to the fact that cassandra is being started using sudo. Is there a workaround for this?

Also is it possible to complete the restore in parallel but complete the startup in a rolling fashion?

Thanks,

Chris.

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: MED-107

chrisjmiller1 commented 9 hours ago

Used sudo -u as a workaround and it allows medusa to start cassandra successfully.