thelastpickle / cassandra-medusa

Apache Cassandra Backup and Restore Tool
Apache License 2.0
263 stars 142 forks source link

S3: medusa getting stuck indefinitely while fetching uploaded object #697

Closed sandeepmallik closed 9 months ago

sandeepmallik commented 10 months ago

Project board link

While uploading backup, medusa getting stuck indefinitely while fetching uploaded object. Not sure what is causing this. We have a cluster of 96 nodes. We added 6 nodes recently and it is failing on 4 nodes with same issue. On other old nodes, backup is working fine. Nothing has changed at infrastructure level.

The object exists in S3 but medusa is unable to fetch/recognise it. Is it something to do with checksum after uploading object? Is there a option to retry Getting object.

[2023-12-18 08:15:17,023] INFO: Uploading /var/lib/cassandra/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/snapshots/medusa-s3_prod_12_17_2023_23/mc-6212-big-TOC.txt (92.000B) [2023-12-18 08:15:17,024] DEBUG: Uploading 152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-TOC.txt as single part [2023-12-18 08:15:17,121] INFO: Uploading /var/lib/cassandra/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/snapshots/medusa-s3_prod_12_17_2023_23/mc-6212-big-Statistics.db (11.435KiB) [2023-12-18 08:15:17,121] DEBUG: Uploading 152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Statistics.db as single part [2023-12-18 08:15:17,126] DEBUG: https://s3.amazonaws.com:443 "PUT /ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-TOC.txt HTTP/1.1" 200 0 [2023-12-18 08:15:17,129] INFO: Uploading /var/lib/cassandra/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/snapshots/medusa-s3_prod_12_17_2023_23/mc-6212-big-Summary.db (264.101KiB) [2023-12-18 08:15:17,129] DEBUG: Uploading 152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Summary.db as single part [2023-12-18 08:15:17,131] INFO: Uploading /var/lib/cassandra/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/snapshots/medusa-s3_prod_12_17_2023_23/mc-6212-big-CompressionInfo.db (50.417KiB) [2023-12-18 08:15:17,131] DEBUG: Uploading 152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-CompressionInfo.db as single part [2023-12-18 08:15:17,147] INFO: Uploading /var/lib/cassandra/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/snapshots/medusa-s3_prod_12_17_2023_23/mc-6212-big-Data.db (141.154MiB) [2023-12-18 08:15:17,147] DEBUG: Uploading 152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Data.db as multi part [2023-12-18 08:15:17,147] DEBUG: aws s3 cp /var/lib/cassandra/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/snapshots/medusa-s3_prod_12_17_2023_23/mc-6212-big-Data.db s3://ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Data.db [2023-12-18 08:15:17,163] INFO: Uploading /var/lib/cassandra/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/snapshots/medusa-s3_prod_12_17_2023_23/mc-6212-big-Index.db (30.050MiB) [2023-12-18 08:15:17,163] DEBUG: Uploading 152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Index.db as single part [2023-12-18 08:15:17,543] DEBUG: https://s3.amazonaws.com:443 "PUT /ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Summary.db HTTP/1.1" 200 0 [2023-12-18 08:15:17,544] DEBUG: https://s3.amazonaws.com:443 "PUT /ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-CompressionInfo.db HTTP/1.1" 200 0 [2023-12-18 08:15:17,545] DEBUG: https://s3.amazonaws.com:443 "PUT /ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Statistics.db HTTP/1.1" 200 0 [2023-12-18 08:15:17,548] INFO: Uploading /var/lib/cassandra/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/snapshots/medusa-s3_prod_12_17_2023_23/mc-6212-big-Digest.crc32 (9.000B) [2023-12-18 08:15:17,548] DEBUG: Uploading 152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Digest.crc32 as single part [2023-12-18 08:15:18,561] DEBUG: https://s3.amazonaws.com:443 "PUT /ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Digest.crc32 HTTP/1.1" 200 0 [2023-12-18 08:15:18,654] DEBUG: https://s3.amazonaws.com:443 "PUT /ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Index.db HTTP/1.1" 200 0 [2023-12-18 08:15:19,810] DEBUG: [Storage] Getting object 152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Data.db [2023-12-18 08:15:19,835] DEBUG: https://s3.amazonaws.com:443 "HEAD /ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Data.db HTTP/1.1" 200 0

[root@ip-172-21-181-249 ~]# aws s3 ls s3://ec4-backup-prod/152.2.81.127/data/keyspace/table-21c2779222113427b87bcf21d527e6c5/mc-6212-big-Data.db 2023-12-18 08:15:18 148010978 mc-6212-big-Data.db

adejanovski commented 10 months ago

Hi @sandeepmallik, it seems like you're not running a recent version of Medusa. We have had loads of issues with previous versions that used a combination of libcloud and awscli to handle communications with S3. One of the things we identified but couldn't fix before was that some files were successfully uploaded using the aws s3 cp command (which is what I can see in the logs you pasted) but couldn't be fetched afterwards using libcloud. This is exactly the behavior your cluster is exhibiting, and I can only advise to upgrade to the latest version which fixes this. The storage backend has been revamped in v0.16 which now uses boto3 to handle all interactions with S3 and provide a rock steady implementation.

sandeepmallik commented 10 months ago

Hi. @adejanovski We will update version to 0.16.3 and see if it works.

sandeepmallik commented 10 months ago

@adejanovski It didn't work. We are using python3.9/pip3.9. After upgrading medusa to 0.16.3 on new nodes, I need to install these packages as per medusa requirements. Old nodes is on 0.15.0

pip3.9 install click-aliases==1.0.1 pip3.9 install wheel==0.32.0

Medusa having issue with fetching IAM role.

[2023-12-18 23:00:03,944] INFO: Registered backup id s3_prod_12_18_2023_23 [2023-12-18 23:00:03,945] INFO: Monitoring provider is noop [2023-12-18 23:00:03,975] INFO: Using credentials CensoredCredentials(access_key_id=A..A, secret_access_key=*, region=us-east-1) [2023-12-18 23:00:03,975] INFO: Using S3 URL https://ec4-backup-prod.s3.amazonaws.com [2023-12-18 23:00:04,329] ERROR: Error getting object from s3://ec4-backup-prod/152.2.82.199/s3_prod_12_18_2023_23/meta/schema.cql [2023-12-18 23:00:04,329] INFO: Starting backup using Stagger: None Mode: differential Name: s3_prod_12_18_2023_23 [2023-12-18 23:00:04,329] INFO: Updated from existing status: -1 to new status: 0 for backup id: s3_prod_12_18_2023_23 [2023-12-18 23:00:04,330] INFO: Saving tokenmap and schema [2023-12-18 23:04:05,842] INFO: Resolving ip address 152.2.82.199 [2023-12-18 23:04:05,843] INFO: ip address to resolve 152.2.82.199 . . . [2023-12-18 23:08:02,027] ERROR: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. [2023-12-18 23:08:22,581] ERROR: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. [2023-12-18 23:09:02,706] ERROR: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. [2023-12-18 23:10:22,840] ERROR: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. [2023-12-18 23:12:22,967] ERROR: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. [2023-12-18 23:14:23,133] ERROR: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. [2023-12-18 23:16:23,260] ERROR: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. [2023-12-18 23:16:23,261] ERROR: Issue occurred inside handle_backup Name: s3_prod_12_18_2023_23 Error: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. [2023-12-18 23:16:23,261] INFO: Updated from existing status: 0 to new status: 2 for backup id: s3_prod_12_18_2023_23 *[2023-12-18 23:16:23,261] ERROR: Error occurred during backup: An error occurred (InvalidAccessKeyId) when calling the PutObject operation: The AWS Access Key Id you provided does not exist in our records. [2023-12-18 23:16:25,452] INFO: Using credentials CensoredCredentials(access_key_id=A..A, secret_access_key=, region=us-east-1) [2023-12-18 23:16:25,453] INFO: Using S3 URL https://ec4-backup-prod.s3.amazonaws.com [2023-12-18 23:24:47,640] INFO: Monitoring provider is noop [2023-12-18 23:24:47,641] INFO: Starting purge [2023-12-18 23:24:47,665] INFO: Using credentials CensoredCredentials(access_key_id=A..A, secret_access_key=***, region=us-east-1) [2023-12-18 23:24:47,665] INFO: Using S3 URL https://ec4-backup-prod.s3.amazonaws.com [2023-12-18 23:24:47,749] INFO: Listing backups for 152.2.82.199 [2023-12-18 23:33:08,373] ERROR: This error happened during the purge: An error occurred (InvalidAccessKeyId) when calling the ListObjectsV2 operation: The AWS Access Key Id you provided does not exist in our records.

$ sudo medusa -vvv list-backups [2023-12-19 07:20:16,770] DEBUG: Loading configuration from /etc/medusa/medusa.ini [2023-12-19 07:20:16,773] INFO: Resolving ip address [2023-12-19 07:20:16,774] INFO: ip address to resolve 172.21.182.99 [2023-12-19 07:20:16,774] DEBUG: Not resolving 172.21.182.99 as requested [2023-12-19 07:20:16,774] DEBUG: Logging to file options: LoggingConfig(enabled='1', file='/var/log/medusa/medusa.log', format='[%(asctime)s] %(levelname)s: %(message)s', level='INFO', maxBytes='20000000', backupCount='50') [2023-12-19 07:20:16,776] DEBUG: Loading storage_provider: s3 [2023-12-19 07:20:16,800] INFO: Using credentials CensoredCredentials(access_key_id=A..T, secret_access_key=**, region=us-east-1) [2023-12-19 07:20:16,800] INFO: Using S3 URL https://ec4-backup-prod.s3.amazonaws.com [2023-12-19 07:20:16,800] DEBUG: Connecting to S3 [2023-12-19 07:20:16,901] DEBUG: [Storage] Listing objects in index/backup_index [2023-12-19 07:20:16,901] DEBUG: Using selector: GeventSelector [2023-12-19 07:20:36,956] DEBUG: [Storage] Listing objects in index/backup_index [2023-12-19 07:21:17,018] DEBUG: [Storage] Listing objects in index/backup_index [2023-12-19 07:22:37,117] DEBUG: [Storage] Listing objects in index/backup_index [2023-12-19 07:24:37,244] DEBUG: [Storage] Listing objects in index/backup_index [2023-12-19 07:26:37,383] DEBUG: [Storage] Listing objects in index/backup_index [2023-12-19 07:28:37,517] DEBUG: [Storage] Listing objects in index/backup_index [2023-12-19 07:28:37,588] DEBUG: Disconnecting from S3... Traceback (most recent call last): File "/usr/local/bin/medusa", line 8, in sys.exit(cli()) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1157, in call return self.main(args, kwargs) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1078, in main rv = self.invoke(ctx) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx)) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, ctx.params) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 783, in invoke return __callback(*args, *kwargs) File "/usr/local/lib/python3.9/site-packages/click/decorators.py", line 92, in new_func return ctx.invoke(f, obj, args, kwargs) File "/usr/local/lib/python3.9/site-packages/click/core.py", line 783, in invoke return callback(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/medusa/medusacli.py", line 185, in list_backups medusa.listing.list_backups(medusaconfig, show_all) File "/usr/local/lib/python3.9/site-packages/medusa/listing.py", line 41, in list_backups list_backups_w_storage(config, show_all, storage) File "/usr/local/lib/python3.9/site-packages/medusa/listing.py", line 45, in list_backups_w_storage cluster_backups = get_backups(storage, config, show_all) File "/usr/local/lib/python3.9/site-packages/medusa/listing.py", line 26, in get_backups cluster_backups = sorted( File "/usr/local/lib/python3.9/site-packages/medusa/storage/init.py", line 358, in list_cluster_backups node_backups = sorted( File "/usr/local/lib/python3.9/site-packages/medusa/storage/init.py", line 179, in list_node_backups backup_index_blobs = self.list_backup_index_blobs() File "/usr/local/lib/python3.9/site-packages/medusa/storage/init__.py", line 270, in list_backup_index_blobs return self.storage_driver.list_objects(path) File "/usr/local/lib/python3.9/site-packages/retrying.py", line 56, in wrapped_f return Retrying(*dargs, dkw).call(f, *args, *kw) File "/usr/local/lib/python3.9/site-packages/retrying.py", line 266, in call raise attempt.get() File "/usr/local/lib/python3.9/site-packages/retrying.py", line 301, in get six.reraise(self.value[0], self.value[1], self.value[2]) File "/usr/local/lib/python3.9/site-packages/six.py", line 719, in reraise raise value File "/usr/local/lib/python3.9/site-packages/retrying.py", line 251, in call attempt = Attempt(fn(args, kwargs), attempt_number, False) File "/usr/local/lib/python3.9/site-packages/medusa/storage/abstract_storage.py", line 71, in list_objects objects = self.list_blobs(prefix=path) File "/usr/local/lib/python3.9/site-packages/medusa/storage/abstract_storage.py", line 79, in list_blobs objects = loop.run_until_complete(self._list_blobs(prefix)) File "/usr/lib64/python3.9/asyncio/base_events.py", line 647, in run_until_complete return future.result() File "/usr/local/lib/python3.9/site-packages/medusa/storage/s3_base_storage.py", line 202, in _list_blobs response = self.s3_client.get_paginator('list_objects_v2').paginate( File "/usr/local/lib/python3.9/site-packages/botocore/paginate.py", line 479, in build_full_result for response in self: File "/usr/local/lib/python3.9/site-packages/botocore/paginate.py", line 269, in iter response = self._make_request(current_kwargs) File "/usr/local/lib/python3.9/site-packages/botocore/paginate.py", line 357, in _make_request return self._method(**current_kwargs) File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 553, in _api_call return self._make_api_call(operation_name, kwargs) File "/usr/local/lib/python3.9/site-packages/botocore/client.py", line 1009, in _make_api_call raise error_class(parsed_response, operation_name) botocore.exceptions.ClientError: An error occurred (InvalidAccessKeyId) when calling the ListObjectsV2 operation: The AWS Access Key Id you provided does not exist in our records.

$ aws s3 ls s3://ec4-backup-prod/index/backup_index/ PRE s3_prod_12_23_2023_11/ PRE s3_prod_12_23_2023_23/

adejanovski commented 10 months ago

Hi @sandeepmallik, we have a contributed fix for the role issue which you can track here. Anything you can contribute to the discussion or in terms of testing would help a lot to get this merged in. Thanks!

sandeepmallik commented 10 months ago

@adejanovski I have tested the fix on my cluster and it didn't work. I couldn't contribute at code level as I am not familier with Python.

adejanovski commented 10 months ago

Thanks @sandeepmallik, we're having a conversation around using a slightly different fix for this which would build the S3 client without passing the credentials. I've added #691 as a fix for this PR which will update its status once we merge it.

sandeepmallik commented 10 months ago

Thanks @adejanovski. As temporary fix, I will comment these lines in s3_base_storage.py as we only use IAM role.

aws_access_key_id=credentials.access_key, aws_secret_access_key=credentials.secret_key,

adejanovski commented 10 months ago

Sure, let us know how this works.

sandeepmallik commented 10 months ago

@adejanovski It didn't work. After snapshot got created, it got stuck indefinitely. Nothing is logs even after using verbose.

sandeepmallik commented 9 months ago

@adejanovski Tested backup on ONE node with new code (https://github.com/thelastpickle/cassandra-medusa/blob/082375b52e6a8f0e376586984444852cb7371c72/medusa/storage/s3_base_storage.py). I upgraded medusa to 0.17.0 and replace above file. Backup worked.

Will test this procedure on remaining nodes. If it works then we are good to merge the change.

[2024-01-01 23:06:12,870] INFO: Creating snapshot [2024-01-02 01:38:58,185] WARNING: Connection pool is full, discarding connection: backup.s3.amazonaws.com [2024-01-02 01:38:58,267] WARNING: Connection pool is full, discarding connection: backup.s3.amazonaws.com [2024-01-02 01:38:58,423] WARNING: Connection pool is full, discarding connection: backup.s3.amazonaws.com [2024-01-02 03:45:54,869] INFO: Using credentials CensoredCredentials(access_key_id=None, secret_access_key=, region=us-east-1) [2024-01-02 03:45:54,870] INFO: Connecting to s3 with args {} [2024-01-02 03:46:09,490] INFO: Using credentials CensoredCredentials(access_key_id=None, secret_access_key=, region=us-east-1) [2024-01-02 03:46:09,491] INFO: Connecting to s3 with args {} [2024-01-02 04:23:31,587] INFO: Updating backup index [2024-01-02 04:23:32,709] INFO: Backup done [2024-01-02 04:23:32,710] INFO: - Started: 2024-01-01 23:00:04

sandeepmallik commented 9 months ago

@adejanovski Fix https://github.com/thelastpickle/cassandra-medusa/pull/691 worked. Tested it on 6 nodes.