thelastpickle / cassandra-medusa

Apache Cassandra Backup and Restore Tool
Apache License 2.0
266 stars 143 forks source link

Node fails S3 upload with RuntimeError: Unable to get current placement #820

Open pdpol opened 1 week ago

pdpol commented 1 week ago

Project board link

Hi, I have two clusters that have been running medusa successfully for some time until recently, one of the clusters began encountering issues where it would appear to hang for a long time and then report Exception: Some nodes failed to upload the backup..

At first it appeared that retrying would eventually get through each node, with the subsequent one then failing next, however, it now appears to be stuck on the second to last node (11 of 12) and failing in the same manner.

On the node itself, I see the following error in the medusa logs:

[2024-11-17 03:33:08,972] ERROR: Issue occurred inside handle_backup Name: main-cluster-11-17-2024 Error: Unable to get current placement
[2024-11-17 03:33:08,972] INFO: Updated from existing status: 0 to new status: 2 for backup id: main-cluster-11-17-2024
[2024-11-17 03:33:08,972] ERROR: Error occurred during backup: Unable to get current placement
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/medusa/backup_node.py", line 195, in handle_backup
    info = start_backup(storage, node_backup, cassandra, differential_mode, stagger_time, start, mode,
  File "/usr/local/lib/python3.10/dist-packages/medusa/backup_node.py", line 226, in start_backup
    schema, tokenmap = get_schema_and_tokenmap(cassandra)
  File "/usr/local/lib/python3.10/dist-packages/retrying.py", line 56, in wrapped_f
    return Retrying(*dargs, **dkw).call(f, *args, **kw)
  File "/usr/local/lib/python3.10/dist-packages/retrying.py", line 266, in call
    raise attempt.get()
  File "/usr/local/lib/python3.10/dist-packages/retrying.py", line 301, in get
    six.reraise(self.value[0], self.value[1], self.value[2])
  File "/usr/lib/python3/dist-packages/six.py", line 719, in reraise
    raise value
  File "/usr/local/lib/python3.10/dist-packages/retrying.py", line 251, in call
    attempt = Attempt(fn(*args, **kwargs), attempt_number, False)
  File "/usr/local/lib/python3.10/dist-packages/medusa/backup_node.py", line 278, in get_schema_and_tokenmap
    tokenmap = cql_session.tokenmap()
File "/usr/local/lib/python3.10/dist-packages/medusa/cassandra_utils.py", line 178, in tokenmap
    dc_rack_pair = self.placement()
  File "/usr/local/lib/python3.10/dist-packages/medusa/cassandra_utils.py", line 174, in placement
    raise RuntimeError('Unable to get current placement')
RuntimeError: Unable to get current placement

and a warning shortly preceding it:

Traceback (most recent call last):
  File "cassandra/cluster.py", line 3234, in cassandra.cluster.Session.add_or_renew_pool.run_add_or_renew_pool
  File "cassandra/pool.py", line 406, in cassandra.pool.HostConnection.__init__
  File "cassandra/cluster.py", line 1670, in cassandra.cluster.Cluster.connection_factory
  File "cassandra/connection.py", line 846, in cassandra.connection.Connection.factory
  File "/usr/local/lib/python3.10/dist-packages/cassandra/io/geventreactor.py", line 76, in __init__
    self._connect_socket()
  File "cassandra/connection.py", line 951, in cassandra.connection.Connection._connect_socket
OSError: [Errno 113] Tried connecting to [('10.0.3.210', 9042)]. Last error: No route to host

What's particularly interesting about this is that we don't actually appear to have a node with the address 10.0.3.210, but I checked another node's logs and it also reported connecting to a node with this address successfully:

[2024-11-17 02:48:16,910] DEBUG: [control connection] Found new host to connect to: 10.0.3.210:9042
[2024-11-17 02:48:16,910] INFO: New Cassandra host <Host: 10.0.3.210:9042 us-east> discovered
[2024-11-17 02:48:16,910] DEBUG: Handling new host <Host: 10.0.3.210:9042 us-east> and notifying listeners
[2024-11-17 02:48:16,910] DEBUG: Done preparing queries for new host <Host: 10.0.3.210:9042 us-east>
[2024-11-17 02:48:16,910] DEBUG: Host 10.0.3.210:9042 is now marked up

Another thing I'll note - one of the ways I've tried to keep tabs on a backup's progress is by looking at network bytes sent in our datadog dashboard. The reason I previously thought that nodes were getting stuck but then working on a subsequent run is because they'd spike up to ~64 MiB/s for about 12 hours before dying, and then once they appeared to complete, would then only spike up for an hour at most. Without really digging in, I assumed it had to do with incremental progress being made due to differential mode.

The error logs I posted above were from around 2024-11-17 03:33:08,972, which was actually the very beginning of the network spike on this node that failed, so I wonder what the sustained activity would have been for the following 12 hours?

Appreciate any guidance you can give here!

┆Issue is synchronized with this Jira Story by Unito ┆Issue Number: MED-106