thelastpickle / cassandra-medusa

Apache Cassandra Backup and Restore Tool
Apache License 2.0
264 stars 142 forks source link

medusa backup-cluster failed with Snapshot ... already exists #676

Open th2zz opened 1 year ago

th2zz commented 1 year ago

Project board link

Hello, I am experimenting with cassandra-medusa with local storage provider and plan to use it in our Cassandra cluster for database backup and recovery.

My environment is:

When I run medusa backup-cluster command, it will first reports

[2023-10-27 05:39:23,997] INFO: Creating snapshots on all nodes
[2023-10-27 05:39:23,997] INFO: Executing "nodetool "-Dcom.sun.jndi.rmiURLParsing=legacy" snapshot -t medusa-202310270539" on following nodes ['cassandranode3', 'cassandranode2', 'cassandranode1'] with a parallelism/pool size of 500
[2023-10-27 05:39:25,827] INFO: Job executing "nodetool "-Dcom.sun.jndi.rmiURLParsing=legacy" snapshot -t medusa-202310270539" ran and finished Successfully on all nodes.
[2023-10-27 05:39:25,828] INFO: A snapshot medusa-202310270539 was created on all nodes.
[2023-10-27 05:39:25,828] INFO: Uploading snapshots from nodes to external storage

and then it will consistently fail with:

[2023-10-27 05:47:48,955] ERROR: Job executing "mkdir -p /tmp/medusa-job-c092a5ec-c2af-4ce9-8cec-fa72aa152f2e; cd /tmp/medusa-job-c092a5ec-c2af-4ce9-8cec-fa72aa152f2e && medusa-wrapper  medusa  -vvv backup-node --backup-name 202310270539   --mode differential" ran and finished with errors on following nodes: ['cassandranode1', 'cassandranode2', 'cassandranode3']
[2023-10-27 05:47:48,956] ERROR: Some nodes failed to upload the backup.
[2023-10-27 05:47:48,957] ERROR: This error happened during the cluster backup: Some nodes failed to upload the backup.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/medusa/backup_cluster.py", line 64, in orchestrate
    backup.execute(cql_session_provider)
  File "/usr/local/lib/python3.6/site-packages/medusa/backup_cluster.py", line 150, in execute
    self._upload_backup()
  File "/usr/local/lib/python3.6/site-packages/medusa/backup_cluster.py", line 177, in _upload_backup
    raise Exception(err_msg)
Exception: Some nodes failed to upload the backup.

Here is the output when I check out the stderr file in the temporary directory medusa created, it says "Snapshot medusa-202310270129 already exists." but in fact this is the first time I tirgger the backup and all I did was medusa backup-cluster.

[2023-10-27 01:29:10,093] DEBUG: Closed socket to cassandranode1:9042
[2023-10-27 01:29:10,094] DEBUG: Blob cassandra-backup-test-/cassandranode1/202310270129/meta/tokenmap.json was not found in cache.
[2023-10-27 01:29:10,094] DEBUG: [Storage] Getting object cassandra-backup-test-/cassandranode1/202310270129/meta/tokenmap.json
...
[2023-10-27 01:29:10,095] DEBUG: [Storage] Getting object cassandra-backup-test-/cassandranode1/202310270129/meta/schema.cql
[2023-10-27 01:29:10,096] DEBUG: [Storage] Getting object cassandra-backup-test-/index/latest_backup/cassandranode1/backup_name.txt
[2023-10-27 01:29:10,096] DEBUG: [Storage] Getting object cassandra-backup-test-/cassandranode1/None/meta/differential
[2023-10-27 01:29:10,096] DEBUG: [Storage] Getting object cassandra-backup-test-/cassandranode1/None/meta/incremental
[2023-10-27 01:29:10,096] INFO: Node cassandranode1 does not have latest backup
[2023-10-27 01:29:10,096] INFO: Creating snapshot
[2023-10-27 01:29:10,096] DEBUG: Executing: nodetool -Dcom.sun.jndi.rmiURLParsing=legacy snapshot -t medusa-202310270129
error: Snapshot medusa-202310270129 already exists.
-- StackTrace --
java.io.IOException: Snapshot medusa-202310270129 already exists.
    at org.apache.cassandra.service.StorageService.takeSnapshot(StorageService.java:4104)
    at org.apache.cassandra.service.StorageService.takeSnapshot(StorageService.java:3993)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at sun.reflect.misc.Trampoline.invoke(MethodUtil.java:71)
    at jdk.internal.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at java.base/sun.reflect.misc.MethodUtil.invoke(MethodUtil.java:260)
    at java.management/com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:112)
    at java.management/com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:46)
    at java.management/com.sun.jmx.mbeanserver.MBeanIntrospector.invokeM(MBeanIntrospector.java:237)
    at java.management/com.sun.jmx.mbeanserver.PerInterface.invoke(PerInterface.java:138)
    at java.management/com.sun.jmx.mbeanserver.MBeanSupport.invoke(MBeanSupport.java:252)
    at java.management/com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.invoke(DefaultMBeanServerInterceptor.java:809)
    at java.management/com.sun.jmx.mbeanserver.JmxMBeanServer.invoke(JmxMBeanServer.java:801)
    at java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl.doOperation(RMIConnectionImpl.java:1466)
    at java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl$PrivilegedOperation.run(RMIConnectionImpl.java:1307)
    at java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl.doPrivilegedOperation(RMIConnectionImpl.java:1399)
    at java.management.rmi/javax.management.remote.rmi.RMIConnectionImpl.invoke(RMIConnectionImpl.java:827)
    at java.base/jdk.internal.reflect.GeneratedMethodAccessor80.invoke(Unknown Source)
    at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.base/java.lang.reflect.Method.invoke(Method.java:566)
    at java.rmi/sun.rmi.server.UnicastServerRef.dispatch(UnicastServerRef.java:359)
    at java.rmi/sun.rmi.transport.Transport$1.run(Transport.java:200)
    at java.rmi/sun.rmi.transport.Transport$1.run(Transport.java:197)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.rmi/sun.rmi.transport.Transport.serviceCall(Transport.java:196)
    at java.rmi/sun.rmi.transport.tcp.TCPTransport.handleMessages(TCPTransport.java:562)
    at java.rmi/sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run0(TCPTransport.java:796)
    at java.rmi/sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.lambda$run$0(TCPTransport.java:677)
    at java.base/java.security.AccessController.doPrivileged(Native Method)
    at java.rmi/sun.rmi.transport.tcp.TCPTransport$ConnectionHandler.run(TCPTransport.java:676)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

[2023-10-27 01:29:10,762] ERROR: Issue occurred inside handle_backup Name: 202310270129 Error: Command '['nodetool', '-Dcom.sun.jndi.rmiURLParsing=legacy', 'snapshot', '-t', 'medusa-202310270129']' returned non-zero exit status 2.
[2023-10-27 01:29:10,762] INFO: Updated from existing status: 0 to new status: 2 for backup id: 202310270129 
[2023-10-27 01:29:10,762] ERROR: Error occurred during backup: Command '['nodetool', '-Dcom.sun.jndi.rmiURLParsing=legacy', 'snapshot', '-t', 'medusa-202310270129']' returned non-zero exit status 2.
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/medusa/backup_node.py", line 199, in handle_backup
    enable_md5_checks_flag, backup_name, config, monitoring)
  File "/usr/local/lib/python3.6/site-packages/medusa/backup_node.py", line 252, in start_backup
    cassandra, node_backup, storage, differential_mode, enable_md5, config, backup_name)
  File "/usr/local/lib/python3.6/site-packages/medusa/backup_node.py", line 296, in do_backup
    with cassandra.create_snapshot(backup_name) as snapshot:
  File "/usr/local/lib/python3.6/site-packages/medusa/cassandra_utils.py", line 450, in create_snapshot
    self.snapshot_service.create_snapshot(tag=tag)
  File "/usr/local/lib/python3.6/site-packages/medusa/service/snapshot/nodetool_snapshot_service.py", line 32, in create_snapshot
    subprocess.check_call(cmd, stdout=subprocess.DEVNULL, universal_newlines=True)
  File "/usr/local/lib64/python3.6/site-packages/gevent/subprocess.py", line 350, in check_call
    raise CalledProcessError(retcode, cmd) # pylint:disable=undefined-variable
subprocess.CalledProcessError: Command '['nodetool', '-Dcom.sun.jndi.rmiURLParsing=legacy', 'snapshot', '-t', 'medusa-202310270129']' returned non-zero exit status 2.

Here is the complete config I am using:

[cassandra]
stop_cmd = systemctl stop cassandra
start_cmd = systemctl start cassandra
config_file = /home/cassandra/cassandra/conf/cassandra.yaml
cql_username = cassandra
cql_password = password

certfile= /home/cassandra/cassandra/certs/cert.pem
usercert= /home/cassandra/cassandra/certs/cert.pem
userkey= /home/cassandra/cassandra/certs/key.pem
sstableloader_ts = /home/cassandra/cassandra/conf/.truststore
sstableloader_tspw = password
sstableloader_ks = /home/cassandra/cassandra/conf/.keystore
sstableloader_kspw = password
sstableloader_bin = /home/cassandra/cassandra/bin/sstableloader
check_running = /home/cassandra/cassandra/bin/nodetool version
resolve_ip_addresses = true
use_sudo = true

[storage]

use_sudo_for_restore = true
host_file_separator = ","
bucket_name = cassandra_backups
storage_provider = local
fqdn = cassandranode1
base_path = /tmp
prefix = cassandra-backup-test-
backup_grace_period_in_days = 10
max_backup_age = 5
max_backup_count = 2

[monitoring]
monitoring_provider = local

[ssh]
username = cassandra
key_file = /home/cassandra/my_private.key
port = 22

[checks]
enable_md5_checks = True

[logging]
enabled = 1
file = medusa.log
level = INFO
format = [%(asctime)s] %(levelname)s: %(message)s
maxBytes = 20000000
backupCount = 5

[grpc]
enabled = False

[kubernetes]
enabled = False

any suggestions are appreciated, thanks !

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: MED-21

rzvoncek commented 1 year ago

Hi @th2zz !

This seems really odd. There is code to not create a snapshot if it already exists. So the bug is not trivial.

It's also weird the node-local log complains about a different backup than the cluster-wide one.

But I should have enough info, so I'll try to reproduce this next week.

rzvoncek commented 1 year ago

Hi again @th2zz !

I've now spent some time trying to reproduce this issue. Sadly (or luckily), I did not manage to.

The only time I've managed to get a backup already exists error was when I ctrl+c'ed an execution and immediatelly started a new one. This was when I did not supply a --backup-name option to the backup-cluster command, which made Medusa generate one. The format for this is backup-YYYYMMDDHHMM, so if this happens twice in one minute, then there's going to be clash.

This is still different from the reported issue of already existing snapshots. The snapshot clearing code we have seems to be fairly robust and clears snapshots even if the backup fails. Having an existing snapshot but not an exisitng backup is possible, but requires a really peculiar sequence of events to happen, and as such it's really unlikely to happen.

I can't tell from the original report what that sequence is. To be frank, I think the log traces come from two separate occasions.

To wrap up. Please make sure that:

If you're still seeing the issue, then please re-share the logs from medusa-job-* folder and the trace the backup-cluster command prints into the console.

ap1und1 commented 10 months ago

Hi I have very similar issue:

VenkataNaladala94 commented 5 months ago

Hi @ap1und1

From the below log entry that you have posted, it seems you have provided hostname for nodetool_host in medusa.ini file. Due to this nodetool -Dcom.sun.jndi.rmiURLParsing=legacy --ssl -u USER -pw PASSWORD -h server0101.server.lan -p 7199 snapshot -t medusa-202312132025 is run on all the nodes with server0101.server.lan. So it is like running the nodetool snapshot on server0101.server.lan three time. That is the reason it fails with Snapshot medusa-202312132025 already exists.

[2023-12-13 20:25:13,619] INFO: Executing "nodetool -Dcom.sun.jndi.rmiURLParsing=legacy --ssl -u USER -pw PASSWORD -h server0101.server.lan -p 7199 snapshot -t medusa-202312132025" on following nodes ['server0102.server.lan', 'server0101.server.lan', 'server0103.server.lan'] with a parallelism/pool size of 500