Closed ValkyrieOps closed 3 months ago
@ValkyrieOps, do you get the same error when you ssh into your containers and invoke nodetool snapshot
by hand?
Negative. Both console and debug logs do not show any errors when executing nodetool snapshot by hand:
[root@hostname]# ssh cassandra01
root@77f1fb5a78e9:/# nodetool snapshot
Requested creating snapshot(s) for [all keyspaces] with snapshot name [1648560489965] and options {skipFlush=false}
Snapshot directory: 1648560489965
debug.log
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,396 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in parent_repair_history
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,397 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in repair_history
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,397 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in view_build_status
DEBUG [RMI TCP Connection(6)-172.17.0.3] 2022-03-29 13:28:10,398 ColumnFamilyStore.java:931 - Enqueuing flush of compaction_history: 1.119KiB (0%) on-heap, 0.000KiB (0%) off-heap
DEBUG [PerDiskMemtableFlushWriter_0:385] 2022-03-29 13:28:10,400 Memtable.java:456 - Writing Memtable-compaction_history@518640818(0.226KiB serialized bytes, 1 ops, 0%/0% of on/off-heap limit), flushed range = (min(-9223372036854775808), max(9223372036854775807)]
DEBUG [PerDiskMemtableFlushWriter_0:385] 2022-03-29 13:28:10,400 Memtable.java:485 - Completed flushing /var/lib/cassandra/data/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/md-190-big-Data.db (0.122KiB) for commitlog position CommitLogPosition(segmentId=1647874673958, position=8592235)
DEBUG [MemtableFlushWriter:385] 2022-03-29 13:28:10,403 ColumnFamilyStore.java:1229 - Flushed to [BigTableReader(path='/var/lib/cassandra/data/system/compaction_history-b4dbb7b4dc493fb5b3bfce6e434832ca/md-190-big-Data.db')] (1 sstables, 5.272KiB), biggest 5.272KiB, smallest 5.272KiB
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,405 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in schema_aggregates
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,405 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in schema_triggers
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,406 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in size_estimates
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,407 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in paxos
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,407 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in views_builds_in_progress
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,408 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in batches
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,408 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in schema_keyspaces
DEBUG [RMI TCP Connection(6)-172.17.0.3] 2022-03-29 13:28:10,408 ColumnFamilyStore.java:931 - Enqueuing flush of sstable_activity: 0.590KiB (0%) on-heap, 0.000KiB (0%) off-heap
DEBUG [PerDiskMemtableFlushWriter_0:386] 2022-03-29 13:28:10,410 Memtable.java:456 - Writing Memtable-sstable_activity@827336462(0.031KiB serialized bytes, 4 ops, 0%/0% of on/off-heap limit), flushed range = (min(-9223372036854775808), max(9223372036854775807)]
DEBUG [PerDiskMemtableFlushWriter_0:386] 2022-03-29 13:28:10,410 Memtable.java:485 - Completed flushing /var/lib/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/md-302-big-Data.db (0.195KiB) for commitlog position CommitLogPosition(segmentId=1647874673958, position=8592235)
DEBUG [MemtableFlushWriter:386] 2022-03-29 13:28:10,413 ColumnFamilyStore.java:1229 - Flushed to [BigTableReader(path='/var/lib/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/md-302-big-Data.db')] (1 sstables, 5.156KiB), biggest 5.156KiB, smallest 5.156KiB
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,414 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in batchlog
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,414 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in schema_columns
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,415 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in hints
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,415 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in IndexInfo
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,415 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in schema_columnfamilies
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,415 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in schema_functions
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,416 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in built_views
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,416 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in peer_events
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,416 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in range_xfers
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,416 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in peers
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,418 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in transferred_ranges
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,418 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in schema_usertypes
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,418 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in local
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,419 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in available_ranges
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,419 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in prepared_statements
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,420 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in columns
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,421 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in types
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,421 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in indexes
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,421 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in keyspaces
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,422 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in dropped_columns
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,422 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in aggregates
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,422 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in triggers
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,423 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in tables
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,423 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in views
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,424 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in functions
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,424 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in roles
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,425 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in role_members
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,425 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in resource_role_permissons_index
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,425 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in role_permissions
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,426 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in sessions
DEBUG [MemtablePostFlush:193] 2022-03-29 13:28:10,426 ColumnFamilyStore.java:986 - forceFlush requested but everything is clean in events
The problem could come from the usage of -h 10.10.10.10
which makes nodetool access JMX on the external interface instead of going straight to localhost.
Could you remove the nodetool_host
entry from the config file? Since we're sshing into the containers, I'm pretty sure you shouldn't need it.
Hmmm potential progress? I removed the nodetool_host entry and was initially greeted with the same error. I attempted to add a seed target and get the following error now which seems slightly more promising:
[root@hostname]# medusa backup-cluster --seed-target 10.10.10.10
[2022-03-29 10:00:34,511] INFO: Monitoring provider is noop
[2022-03-29 10:00:34,513] INFO: Starting backup 202203291000
[2022-03-29 10:00:34,819] INFO: Creating snapshots on all nodes
[2022-03-29 10:00:34,820] INFO: Executing "nodetool -u cassandra -pw cassandra snapshot -t medusa-202203291000" on following nodes ['10.10.10.10', '10.10.10.11'] with a parallelism/pool size of 500
[2022-03-29 10:00:36,186] INFO: Job executing "nodetool -u cassandra -pw cassandra snapshot -t medusa-202203291000" ran and finished Successfully on all nodes.
[2022-03-29 10:00:36,186] INFO: A snapshot medusa-202203291000 was created on all nodes.
[2022-03-29 10:00:36,187] INFO: Uploading snapshots from nodes to external storage
[2022-03-29 10:00:36,187] INFO: Executing "mkdir -p /tmp/medusa-job-8d7019d5-cd25-4ddb-a492-419da3ec6bca; cd /tmp/medusa-job-8d7019d5-cd25-4ddb-a492-419da3ec6bca && medusa-wrapper medusa -vvv backup-node --backup-name 202203291000 --mode differential" on following nodes ['10.10.10.10', '10.10.10.11'] with a parallelism/pool size of 1
[2022-03-29 10:00:37,314] ERROR: Job executing "mkdir -p /tmp/medusa-job-8d7019d5-cd25-4ddb-a492-419da3ec6bca; cd /tmp/medusa-job-8d7019d5-cd25-4ddb-a492-419da3ec6bca && medusa-wrapper medusa -vvv backup-node --backup-name 202203291000 --mode differential" ran and finished with errors on following nodes: ['10.10.10.10', '10.10.10.11']
[2022-03-29 10:00:37,315] INFO: [10.10.10.10] [err] mkdir: unrecognized option '--backup-name'
[2022-03-29 10:00:37,315] INFO: 10.10.10.10-stderr: mkdir: unrecognized option '--backup-name'
[2022-03-29 10:00:37,315] INFO: [10.10.10.10] [err] Try 'mkdir --help' for more information.
[2022-03-29 10:00:37,315] INFO: 10.10.10.10-stderr: Try 'mkdir --help' for more information.
[2022-03-29 10:00:37,315] INFO: [10.10.10.11] [err] mkdir: unrecognized option '--backup-name'
[2022-03-29 10:00:37,315] INFO: 10.10.10.11-stderr: mkdir: unrecognized option '--backup-name'
[2022-03-29 10:00:37,316] INFO: [10.10.10.11] [err] Try 'mkdir --help' for more information.
[2022-03-29 10:00:37,316] INFO: 10.10.10.11-stderr: Try 'mkdir --help' for more information.
[2022-03-29 10:00:37,316] ERROR: Some nodes failed to upload the backup.
[2022-03-29 10:00:37,316] ERROR: This error happened during the cluster backup: Some nodes failed to upload the backup.
Traceback (most recent call last):
File "/usr/local/lib/python3.9/site-packages/medusa/backup_cluster.py", line 64, in orchestrate
backup.execute(cql_session_provider)
File "/usr/local/lib/python3.9/site-packages/medusa/backup_cluster.py", line 149, in execute
self._upload_backup()
File "/usr/local/lib/python3.9/site-packages/medusa/backup_cluster.py", line 176, in _upload_backup
raise Exception(err_msg)
Exception: Some nodes failed to upload the backup.
[2022-03-29 10:00:37,317] ERROR: Something went wrong! Attempting to clean snapshots and exit.
[2022-03-29 10:00:37,317] INFO: Executing "nodetool -u cassandra -pw cassandra clearsnapshot -t medusa-202203291000" on following nodes ['10.10.10.10', '10.10.10.11'] with a parallelism/pool size of 1
[2022-03-29 10:00:39,955] INFO: Job executing "nodetool -u cassandra -pw cassandra clearsnapshot -t medusa-202203291000" ran and finished Successfully on all nodes.
[2022-03-29 10:00:39,955] INFO: All nodes successfully cleared their snapshot.
it is progress :)
Now you need to check in /tmp/medusa-job-8d7019d5-cd25-4ddb-a492-419da3ec6bca/stderr
what went wrong with the uploads.
Does not appear to me that any /tmp/medusa-job*
directories are being created so no stderr to verify (and probably the source of the problem altogether). I've confirmed both docker containers have write permissions to backup directories and the underlying volume though so not entirely sure where it would be getting stuck. Also confirmed that neither container has any reference to a medusa job anywhere, in case paths were an issue. 🤔
Please add your planning poker estimate with ZenHub @adejanovski
@jsanda, I think we need to investigate more until we can decide if there's something to estimate here. I'm not sure about what's going on exactly here and it's quite an unusual setup for Medusa. @ValkyrieOps, were you able to come up with more observations that could help troubleshooting?
@adejanovski unfortunately I have not been allocated any time to pursue this further. I think our requirements might be out of scope here until support for something like a jump host could be considered. I'd be happy to provide any other details about our configuration if you think that would be relevant?
@adejanovski I think I found the initial issue ValkyrieOps identified. If you use the nodetool_host parameter it actually executes the nodetool snapshot command with the -h parameter set to the hostname that was used to run medusa backup-cluster on each node thus creating a duplicate and hence why nodetool is complaining that the snapshot already exists.
[2022-03-17 08:49:03,381] INFO: Executing "nodetool -u cassandra -pw cassandra -h 10.10.10.10 snapshot -t medusa-202203170849" on following nodes ['10.10.10.10', '10.10.10.11'] with a parallelism/pool size of 500
It should set the -h parameter to the hostname of the node that is being snapshot'ed. Not using the nodetool_host parameter would be fine in most use cases. However on my setup JMX is only listening on a remote host since I use a system with multiple Cassandra docker container using the host network_mode and JMX is not listening on 127.0.0.1. Any idea how I could fix this?
[2022-09-20 15:05:54,107] ERROR: Job executing "nodetool -Dcom.sun.jndi.rmiURLParsing=legacy clearsnapshot -t medusa-202209201505" ran and finished with errors on following nodes: ['10.68.171.154', '10.68.171.155', '10.68.171.156', '10.68.171.157', '10.68.171.158', '10.68.171.159'] [2022-09-20 15:05:54,107] INFO: [10.68.171.154] [err] nodetool: Failed to connect to '127.0.0.1:7199' - ConnectException: 'Connection refused (Connection refused)'.
This indeed seems like a duplicate of https://github.com/thelastpickle/cassandra-medusa/issues/375
Project board link
OS: Rocky Linux 8.4 Python Version: 3.9.6
Hello, I am trying to stand up a small POC in a potentially unsupported configuration and running into an error when running
medusa backup-cluster
on a fresh two node cluster running in docker on a single physical host. I did see this issue already created but as I'm running everything as root during testing phase I'm not sure why permissions would come into play here. I've also confirmed that I am able to successfully SSH into each container as if they were standard VMs. Below is the output of the command along with my medusa.ini file. Happy to provide more information about setup details or logs. Cheers/etc/medusa/medusa.ini
┆Issue is synchronized with this Jira Task by Unito ┆friendlyId: K8SSAND-1381 ┆priority: Medium