Hi, I have two clusters that have been running medusa successfully for some time until recently, one of the clusters began encountering issues where it would appear to hang for a long time and then report Exception: Some nodes failed to upload the backup..
At first it appeared that retrying would eventually get through each node, with the subsequent one then failing next, however, it now appears to be stuck on the second to last node (11 of 12) and failing in the same manner.
On the node itself, I see the following error in the medusa logs:
[2024-11-17 03:33:08,972] ERROR: Issue occurred inside handle_backup Name: main-cluster-11-17-2024 Error: Unable to get current placement
[2024-11-17 03:33:08,972] INFO: Updated from existing status: 0 to new status: 2 for backup id: main-cluster-11-17-2024
[2024-11-17 03:33:08,972] ERROR: Error occurred during backup: Unable to get current placement
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/medusa/backup_node.py", line 195, in handle_backup
info = start_backup(storage, node_backup, cassandra, differential_mode, stagger_time, start, mode,
File "/usr/local/lib/python3.10/dist-packages/medusa/backup_node.py", line 226, in start_backup
schema, tokenmap = get_schema_and_tokenmap(cassandra)
File "/usr/local/lib/python3.10/dist-packages/retrying.py", line 56, in wrapped_f
return Retrying(*dargs, **dkw).call(f, *args, **kw)
File "/usr/local/lib/python3.10/dist-packages/retrying.py", line 266, in call
raise attempt.get()
File "/usr/local/lib/python3.10/dist-packages/retrying.py", line 301, in get
six.reraise(self.value[0], self.value[1], self.value[2])
File "/usr/lib/python3/dist-packages/six.py", line 719, in reraise
raise value
File "/usr/local/lib/python3.10/dist-packages/retrying.py", line 251, in call
attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
File "/usr/local/lib/python3.10/dist-packages/medusa/backup_node.py", line 278, in get_schema_and_tokenmap
tokenmap = cql_session.tokenmap()
File "/usr/local/lib/python3.10/dist-packages/medusa/cassandra_utils.py", line 178, in tokenmap
dc_rack_pair = self.placement()
File "/usr/local/lib/python3.10/dist-packages/medusa/cassandra_utils.py", line 174, in placement
raise RuntimeError('Unable to get current placement')
RuntimeError: Unable to get current placement
and a warning shortly preceding it:
Traceback (most recent call last):
File "cassandra/cluster.py", line 3234, in cassandra.cluster.Session.add_or_renew_pool.run_add_or_renew_pool
File "cassandra/pool.py", line 406, in cassandra.pool.HostConnection.__init__
File "cassandra/cluster.py", line 1670, in cassandra.cluster.Cluster.connection_factory
File "cassandra/connection.py", line 846, in cassandra.connection.Connection.factory
File "/usr/local/lib/python3.10/dist-packages/cassandra/io/geventreactor.py", line 76, in __init__
self._connect_socket()
File "cassandra/connection.py", line 951, in cassandra.connection.Connection._connect_socket
OSError: [Errno 113] Tried connecting to [('10.0.3.210', 9042)]. Last error: No route to host
What's particularly interesting about this is that we don't actually appear to have a node with the address 10.0.3.210, but I checked another node's logs and it also reported connecting to a node with this address successfully:
[2024-11-17 02:48:16,910] DEBUG: [control connection] Found new host to connect to: 10.0.3.210:9042
[2024-11-17 02:48:16,910] INFO: New Cassandra host <Host: 10.0.3.210:9042 us-east> discovered
[2024-11-17 02:48:16,910] DEBUG: Handling new host <Host: 10.0.3.210:9042 us-east> and notifying listeners
[2024-11-17 02:48:16,910] DEBUG: Done preparing queries for new host <Host: 10.0.3.210:9042 us-east>
[2024-11-17 02:48:16,910] DEBUG: Host 10.0.3.210:9042 is now marked up
Another thing I'll note - one of the ways I've tried to keep tabs on a backup's progress is by looking at network bytes sent in our datadog dashboard. The reason I previously thought that nodes were getting stuck but then working on a subsequent run is because they'd spike up to ~64 MiB/s for about 12 hours before dying, and then once they appeared to complete, would then only spike up for an hour at most. Without really digging in, I assumed it had to do with incremental progress being made due to differential mode.
The error logs I posted above were from around 2024-11-17 03:33:08,972, which was actually the very beginning of the network spike on this node that failed, so I wonder what the sustained activity would have been for the following 12 hours?
Appreciate any guidance you can give here!
┆Issue is synchronized with this Jira Story by Unito
┆Issue Number: MED-106
Project board link
Hi, I have two clusters that have been running medusa successfully for some time until recently, one of the clusters began encountering issues where it would appear to hang for a long time and then report
Exception: Some nodes failed to upload the backup.
.At first it appeared that retrying would eventually get through each node, with the subsequent one then failing next, however, it now appears to be stuck on the second to last node (11 of 12) and failing in the same manner.
On the node itself, I see the following error in the medusa logs:
and a warning shortly preceding it:
What's particularly interesting about this is that we don't actually appear to have a node with the address
10.0.3.210
, but I checked another node's logs and it also reported connecting to a node with this address successfully:Another thing I'll note - one of the ways I've tried to keep tabs on a backup's progress is by looking at network bytes sent in our datadog dashboard. The reason I previously thought that nodes were getting stuck but then working on a subsequent run is because they'd spike up to ~64 MiB/s for about 12 hours before dying, and then once they appeared to complete, would then only spike up for an hour at most. Without really digging in, I assumed it had to do with incremental progress being made due to differential mode.
The error logs I posted above were from around
2024-11-17 03:33:08,972
, which was actually the very beginning of the network spike on this node that failed, so I wonder what the sustained activity would have been for the following 12 hours?Appreciate any guidance you can give here!
┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: MED-106