thelastpickle / cassandra-medusa

Apache Cassandra Backup and Restore Tool
Apache License 2.0
257 stars 140 forks source link

backup-cluster backing up SSTables that have not changed. #775

Open chrisjmiller1 opened 2 months ago

chrisjmiller1 commented 2 months ago

Project board link

Hi, I'm currently testing Medusa and have created a 3 node cluster with 160GB of data on each instance with 7 keyspaces, 1 table in each keyspace. I took a differential backup using backup-cluster. I then completed a compaction of one of the keyspaces on one instance expecting that only the "new" SSTables would only be uploaded but all the SSTables were uploaded.

See below screen shot from OCI storage bucket which shows the same SSTables have been re-uploaded. image image

If no keyspaces are compacted then medusa seems to recognise that files are already uploaded and doesn't re-upload them.

Here are the backup summaries from each instance: Instance #1 [2024-06-05 10:45:07,441] INFO: Backup done [2024-06-05 10:45:07,441] INFO: - Started: 2024-06-05 09:30:32

Instance #2 [2024-06-05 09:31:41,530] INFO: Backup done [2024-06-05 09:31:41,530] INFO: - Started: 2024-06-05 09:30:32

Instance #3 [2024-06-05 09:31:36,955] INFO: Backup done [2024-06-05 09:31:36,955] INFO: - Started: 2024-06-05 09:30:32

Thanks in advance,

Chris.

rzvoncek commented 2 months ago

Hi @chrisjmiller1 !

This looks very similar to something we recently fixed in https://github.com/thelastpickle/cassandra-medusa/issues/754. I don't think that PR made it to any release, so the quickest way to "fix" this for you is to start using the prefix config option (or CLI argument).

But, if you're already using it and this is still happening, then I'm wrong and you're probably hitting some other issue.

MarcLangsdorf commented 2 months ago

I had the same problem and I can confirm, that using a prefix fixed my problem.

chrisjmiller1 commented 2 months ago

Thanks @rzvoncek and @MarcLangsdorf - I will try the prefix option and will let you know how it goes.

chrisjmiller1 commented 2 months ago

Thanks folks, that worked. Initially added the same prefix for each instance (in the medusa.ini config file) - and had the same issue. Added an instance specific prefix for each instance and it worked.

chrisjmiller1 commented 2 months ago

Hi @rzvoncek and @MarcLangsdorf, I'm now trying to restore a backup using this configuration but getting the following errors. **Command used: nohup sudo medusa restore-node --backup-name 20240612-instance-6 --temp-dir /data/metadata/medusa_tmp &** Contents of nohup.out:

Traceback (most recent call last):
  File "/usr/local/bin/medusa", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/decorators.py", line 92, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/medusa/medusacli.py", line 275, in restore_node
    medusa.restore_node.restore_node(medusaconfig, Path(temp_dir), backup_name, in_place, keep_auth, seeds,
  File "/usr/local/lib/python3.9/site-packages/medusa/restore_node.py", line 50, in restore_node
    restore_node_locally(config, temp_dir, backup_name, in_place, keep_auth, seeds, storage,
  File "/usr/local/lib/python3.9/site-packages/medusa/restore_node.py", line 90, in restore_node_locally
    download_data(config.storage, node_backup, fqtns_to_restore, destination=download_dir)
  File "/usr/local/lib/python3.9/site-packages/medusa/download.py", line 56, in download_data
    storage.storage_driver.download_blobs(srcs, dst)
  File "/usr/local/lib/python3.9/site-packages/medusa/storage/abstract_storage.py", line 150, in download_blobs
    loop.run_until_complete(self._download_blobs(srcs, dest))
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.9/site-packages/medusa/storage/abstract_storage.py", line 158, in _download_blobs
    await asyncio.gather(*coros)
  File "/usr/local/lib/python3.9/site-packages/medusa/storage/s3_base_storage.py", line 294, in _download_blob
    await future
  File "/usr/lib64/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.9/site-packages/medusa/storage/s3_base_storage.py", line 297, in __download_blob
    blob = self.__stat_blob(src)
  File "/usr/local/lib/python3.9/site-packages/medusa/storage/s3_base_storage.py", line 339, in __stat_blob
    resp = self.s3_client.head_object(Bucket=self.bucket_name, Key=key)
  File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 535, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 980, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

Contents of medusa.log

[2024-06-13 09:48:46,571] DEBUG: Loading storage_provider: s3_compatible
[2024-06-13 09:48:46,603] DEBUG: Setting AWS credentials file to /etc/medusa/credentials
[2024-06-13 09:48:46,660] INFO: Using credentials CensoredCredentials(access_key_id=8..5, secret_access_key=*****, region=us-ashburn-1)
[2024-06-13 09:48:46,660] INFO: Connecting to s3_compatible with args {'endpoint_url': 'https://xxxxxxxx.compat.objectstorage.us-ashburn-1.oraclecloud.com:443', 'verify': False}
[2024-06-13 09:48:47,251] DEBUG: [Storage] Listing objects in xxxxxxxx/index/backup_index
[2024-06-13 09:48:47,252] DEBUG: Using selector: GeventSelector
[2024-06-13 09:48:47,428] DEBUG: Blob xxxxxxxx/xxxxxxxx/20240612-instance-1/meta/schema.cql was not found in cache.
[2024-06-13 09:48:47,429] DEBUG: [Storage] Getting object xxxxxxxx/xxxxxxxx/20240612-instance-1/meta/schema.cql
[2024-06-13 09:48:47,443] DEBUG: [Storage] Reading blob xxxxxxxx/index/backup_index/20240612-instance-1/tokenmap_xxxxxxxx.json...
[2024-06-13 09:48:47,542] DEBUG: [Storage] Reading blob xxxxxxxx/index/backup_index/20240612-instance-2/tokenmap_xxxxxxxx.json...
[2024-06-13 09:48:47,580] DEBUG: [Storage] Reading blob xxxxxxxx/index/backup_index/20240612-instance-3/tokenmap_xxxxxxxx.json...
[2024-06-13 09:48:47,615] DEBUG: [Storage] Reading blob xxxxxxxx/index/backup_index/20240612-instance-5/tokenmap_xxxxxxxx.json...
[2024-06-13 09:48:47,666] DEBUG: [Storage] Reading blob xxxxxxxx/index/backup_index/20240612-instance-6/tokenmap_xxxxxxxx.json...
[2024-06-13 09:48:47,695] DEBUG: Disconnecting from S3...
[2024-06-13 09:49:33,757] DEBUG: Loading storage_provider: s3_compatible
[2024-06-13 09:49:33,773] DEBUG: Setting AWS credentials file to /etc/medusa/credentials
[2024-06-13 09:49:33,789] INFO: Using credentials CensoredCredentials(access_key_id=8..5, secret_access_key=*****, region=us-ashburn-1)
[2024-06-13 09:49:33,789] INFO: Connecting to s3_compatible with args {'endpoint_url': 'xxxxxxxx3', 'verify': False}
[2024-06-13 09:49:33,985] DEBUG: [Storage] Getting object xxxxxxxx/20240612-instance-6/meta/differential
[2024-06-13 09:49:33,986] DEBUG: Using selector: GeventSelector
[2024-06-13 09:49:34,102] DEBUG: [S3 Storage] Object xxxxxxxx/20240612-instance-6/meta/differential not found
[2024-06-13 09:49:34,102] DEBUG: Blob xxxxxxxx/xxxxxxxx/20240612-instance-6/meta/schema.cql was not found in cache.
[2024-06-13 09:49:34,103] DEBUG: [Storage] Getting object xxxxxxxx/xxxxxxxx/20240612-instance-6/meta/schema.cql
[2024-06-13 09:49:34,128] DEBUG: [Storage] Getting object xxxxxxxx/xxxxxxxx/20240612-instance-6/meta/manifest.json
[2024-06-13 09:49:34,141] DEBUG: [Storage] Reading blob xxxxxxxx/xxxxxxxx/20240612-instance-6/meta/manifest.json...
[2024-06-13 09:49:34,579] DEBUG: This server has systemd: True
[2024-06-13 09:49:35,140] INFO: Downloading data from backup to /data/metadata/medusa_tmp/medusa-restore-703f25f5-bf55-4ac5-a9db-59f6643fb91e
[2024-06-13 09:49:35,142] DEBUG: Download size: 170170351296, available space: 362635673600
[2024-06-13 09:49:35,142] DEBUG: Loading storage_provider: s3_compatible
[2024-06-13 09:49:35,153] DEBUG: Setting AWS credentials file to /etc/medusa/credentials
[2024-06-13 09:49:35,156] INFO: Using credentials CensoredCredentials(access_key_id=8..5, secret_access_key=*****, region=us-ashburn-1)
[2024-06-13 09:49:35,156] INFO: Connecting to s3_compatible with args {'endpoint_url': 'xxxxxxxx', 'verify': False}
[2024-06-13 09:49:35,169] DEBUG: There is nothing to download for system.IndexInfo-9f5c6374d48532299a0a5094af9ad1e3
[2024-06-13 09:49:35,169] DEBUG: There is nothing to download for system.batches-919a4bc57a333573b03e13fc3f68b465
[2024-06-13 09:49:35,169] DEBUG: There is nothing to download for system.paxos-b7b7f0c2fd0a34108c053ef614bb7c2d
[2024-06-13 09:49:35,169] DEBUG: There is nothing to download for system.paxos_repair_history-ecb8666740b23316bb91e612c8047457
[2024-06-13 09:49:35,169] DEBUG: Downloading  8 files to /data/metadata/medusa_tmp/medusa-restore-703f25f5-bf55-4ac5-a9db-59f6643fb91e/system/local-7ad54392bcdd35a684174e047860b377
[2024-06-13 09:49:35,274] DEBUG: Disconnecting from S3...
[2024-06-13 09:49:35,275] DEBUG: Disconnecting from S3...

Any ideas? Thanks. Chris.

rzvoncek commented 2 months ago

Hi @chrisjmiller1. This looks like you're doing restore with the prefix present. Since the prefix is something that separates data on the storage backend, it's very likely it simply doesn't see the backup done with different (none) prefix.

This should get resolved automatically if you do a backup with the new prefix (althought that backup will have to upload everything). A restore after doing this should be fine.

We could, of course, go tinker with how files are organised in the storage, but that would be manual and error-prone.

chrisjmiller1 commented 2 months ago

Hi @rzvoncek , yes, restore was completed with the prefix present as set in the medusa.ini. All the backups were also taken with the same prefix. So I would have expected things to have been resolved automatically. Is there anything else that could result in this behaviour? Note that I also cleared down everything in the bucket after changing the prefix parameter to ensure the index was clean. Chris.

rzvoncek commented 2 months ago

Ugh, this is starting to look ugly. But let's dig deeper.

From the log snippet I see two executions happening. There are two lines:

[2024-06-13 09:48:46,571] DEBUG: Loading storage_provider: s3_compatible
[2024-06-13 09:49:33,757] DEBUG: Loading storage_provider: s3_compatible

Their differ a bit with how their log lines look like. But it's hard to me to reverse-engineer what precisely they were doing.

Could you please tell me what two commands you ran to cause those logs? And what exact version (or commit) of Medusa are you running?

chrisjmiller1 commented 2 months ago

Hi @rzvoncek , Apologies, I might have confused things. The first command was a list-backups command, the 2nd was the restore: 850 13/06/24 09:48:30 medusa list-backups 851 13/06/24 09:49:18 nohup sudo medusa restore-node --backup-name 20240612-instance-6 --temp-dir /data/metadata/medusa_tmp & Chris.

rzvoncek commented 2 months ago

OKay and does the list command see the backups you'd expect?

chrisjmiller1 commented 2 months ago

Hi @rzvoncek , yes and I used the latest backup name in the restore command.

chrisjmiller1 commented 2 months ago

btw using 0.21.0

rzvoncek commented 2 months ago

Hi @chrisjmiller1 , I'm having tough time reproducing this. Aside from actually deleting an object from the storage, I cannot get this (a 404 error) to happen.

Are you hitting this consistenlty? Could you come up with a sequence of steps to reproduce this deterministically?

Another idea to narrow this down is to add a try-catch with a log printing the key being accessed into "/usr/local/lib/python3.9/site-packages/medusa/storage/s3_base_storage.py", line 339, in __stat_blob. Then we can at least see what precisely is being accessed and, hopefuly, understand why is it not getting found.

chrisjmiller1 commented 2 months ago

HI @rzvoncek , could you provide the complete code you would like me to update in s3_base_storage.py. Thanks.

rzvoncek commented 2 months ago

The __stat_blob() currently looks like this, line 343:

    def __stat_blob(self, key):
        resp = self.s3_client.head_object(Bucket=self.bucket_name, Key=key)
        item_hash = resp['ETag'].replace('"', '')
        return AbstractBlob(key, int(resp['ContentLength']), item_hash, resp['LastModified'], None)

Let's make it something like:

    def __stat_blob(self, key):
        try:
            logging.info('[S3 Storage] Stat-ting s3://{}/{}'.format(self.bucket_name, key))
            resp = self.s3_client.head_object(Bucket=self.bucket_name, Key=key)
            item_hash = resp['ETag'].replace('"', '')
            return AbstractBlob(key, int(resp['ContentLength']), item_hash, resp['LastModified'], None)
        except botocore.exceptions.ClientError as e:
            logging.error(f'Caught exception: {e}')
            raise e
chrisjmiller1 commented 2 months ago

hi @rzvoncek , apologies for the delay. Here's the output from nohup.out

[2024-06-18 15:06:57,772] ERROR: Caught exception: An error occurred (404) when calling the HeadObject operation: Not Found
Traceback (most recent call last):
  File "/usr/local/bin/medusa", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/decorators.py", line 92, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.9/site-packages/medusa/medusacli.py", line 275, in restore_node
    medusa.restore_node.restore_node(medusaconfig, Path(temp_dir), backup_name, in_place, keep_auth, seeds,
  File "/usr/local/lib/python3.9/site-packages/medusa/restore_node.py", line 50, in restore_node
    restore_node_locally(config, temp_dir, backup_name, in_place, keep_auth, seeds, storage,
  File "/usr/local/lib/python3.9/site-packages/medusa/restore_node.py", line 90, in restore_node_locally
    download_data(config.storage, node_backup, fqtns_to_restore, destination=download_dir)
  File "/usr/local/lib/python3.9/site-packages/medusa/download.py", line 56, in download_data
    storage.storage_driver.download_blobs(srcs, dst)
  File "/usr/local/lib/python3.9/site-packages/medusa/storage/abstract_storage.py", line 150, in download_blobs
    loop.run_until_complete(self._download_blobs(srcs, dest))
  File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete
    return future.result()
  File "/usr/local/lib/python3.9/site-packages/medusa/storage/abstract_storage.py", line 158, in _download_blobs
    await asyncio.gather(*coros)
  File "/usr/local/lib/python3.9/site-packages/medusa/storage/s3_base_storage.py", line 294, in _download_blob
    await future
  File "/usr/lib64/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/local/lib/python3.9/site-packages/medusa/storage/s3_base_storage.py", line 297, in __download_blob
    blob = self.__stat_blob(src)
  File "/usr/local/lib/python3.9/site-packages/medusa/storage/s3_base_storage.py", line 349, in __stat_blob
    raise e
  File "/usr/local/lib/python3.9/site-packages/medusa/storage/s3_base_storage.py", line 344, in __stat_blob
    resp = self.s3_client.head_object(Bucket=self.bucket_name, Key=key)
  File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 535, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 980, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

And from medusa.log

[2024-06-18 15:06:57,727] INFO: Using credentials CensoredCredentials(access_key_id=8..5, secret_access_key=*****, region=us-ashburn-1)
[2024-06-18 15:06:57,727] INFO: Connecting to s3_compatible with args {'endpoint_url': 'https://xxxxxxxx.compat.objectstorage.us-ashburn-1.oraclecloud.com:443', 'verify': False}
[2024-06-18 15:06:57,731] DEBUG: There is nothing to download for system.peer_events-59dfeaea8db2334191ef109974d81484
[2024-06-18 15:06:57,731] DEBUG: There is nothing to download for system.transferred_ranges_v2-1ff78f1a7df13a2aa9986f4932270af5
[2024-06-18 15:06:57,731] DEBUG: There is nothing to download for system.view_builds_in_progress-6c22df66c3bd3df6b74d21179c6a9fe9
[2024-06-18 15:06:57,731] DEBUG: There is nothing to download for system.top_partitions-7e5a361c317c351fb15fffd8afd3dd4b
[2024-06-18 15:06:57,731] DEBUG: Downloading  8 files to /data/metadata/medusa_tmp/medusa-restore-41d947a7-a576-4314-8e13-2c9197a57bd1/system/local-7ad54392bcdd35a684174e047860b377
**[2024-06-18 15:06:57,732] INFO: [S3 Storage] Stat-ting s3://tfdev-metbackup/mxiad-tfdevmet01/mxiad-tfdevmet01/mxiad-tfdevmet01/data/system/local-7ad54392bcdd35a684174e047860b377/nb-56-big-TOC.txt**
[2024-06-18 15:06:57,772] ERROR: Caught exception: An error occurred (404) when calling the HeadObject operation: Not Found
[2024-06-18 15:06:57,773] DEBUG: Disconnecting from S3...
[2024-06-18 15:06:57,773] DEBUG: Disconnecting from S3...
rzvoncek commented 2 months ago

Okay, so now we have an URI to check:

s3://tfdev-metbackup/mxiad-tfdevmet01/mxiad-tfdevmet01/mxiad-tfdevmet01/data/system/local-7ad54392bcdd35a684174e047860b377/nb-56-big-TOC.tx

Does this actually exist ?

chrisjmiller1 commented 2 months ago

Hi @rzvoncek It appears that the prefix is being added to the URI one time too many, the 2nd URI below is the correct one.

s3://tfdev-metbackup/mxiad-tfdevmet01/mxiad-tfdevmet01/mxiad-tfdevmet01/data/system/local-7ad54392bcdd35a684174e047860b377/nb-56-big-TOC.txt
s3://tfdev-metbackup/mxiad-tfdevmet01/mxiad-tfdevmet01/data/system/local-7ad54392bcdd35a684174e047860b377/nb-56-big-TOC.txt

Chris.

rzvoncek commented 2 months ago

I wasn't able to work out how would the extra prefix get there. I've tried the following on both 0.21.0 and master:

medusa --prefix prefixCli --config-file medusa-test.ini backup --backup-name foo
medusa --config-file medusa-test.ini backup --backup-name foo

(the config file had a prefix set in it as well).

The behaviour I observed was that:

I'll try to debug this more, but I'll only get to it later.

chrisjmiller1 commented 2 months ago

Hi @rzvoncek - thanks for the update. FYI, I am only using prefix in the config file not on the command line.