thelastpickle / cassandra-medusa

Apache Cassandra Backup and Restore Tool
Apache License 2.0
262 stars 141 forks source link

Unable to backup second dc node in a multi data centre cluster #495

Open kaushalkumar opened 2 years ago

kaushalkumar commented 2 years ago

Project board link

Hi - We are trying to backup data (of Cassandara database) from a multi data centre cluster using medusa (local mode). The backup is created for dc1 (all nodes) but none of the nodes of dc2 are backed-up. We have tried different configurations, but have not got success.

It seems medusa is using cassandra python driver to discover the nodes in both cluster, but somehow it is not able to discover dc2 nodes.

Can you please check and let us know what could be missing. If there is any readme/blog for this use case, then please do point us to that. It would be help us.

Configuration

Cluster: dc1 dc2
node1-dc1.abcd.com node1-dc2.abcd.com
node2-dc1.abcd.com node2-dc2.abcd.com
node3-dc1.abcd.com node3-dc2.abcd.com

Version: [cqlsh 5.0.1 | Cassandra 3.11.11 | CQL spec 3.4.4 | Native protocol v4], Medusa [0.13.3]

medusa.ini:

[cassandra]
;stop_cmd = /etc/init.d/cassandra stop
;start_cmd = /etc/init.d/cassandra start
config_file = /etc/cassandra/default.conf/cassandra.yaml
cql_username = cqladmin
cql_password = cqladmin
nodetool_username =  nodetooladmin
nodetool_password =  nodetooladmin
;nodetool_password_file_path = <path to nodetool password file>
;nodetool_host = <host name or IP to use for nodetool>
;nodetool_port = <port number to use for nodetool>
certfile= /etc/cassandra/conf/cassandra_keystore.pem
;usercert= <Client SSL: path to user certificate>
;userkey= <Client SSL: path to user key>
;sstableloader_ts = <Client SSL: full path to truststore>
;sstableloader_tspw = <Client SSL: password of the truststore>
;sstableloader_ks = <Client SSL: full path to keystore>
;sstableloader_kspw = <Client SSL: password of the keystore>
;sstableloader_bin = <Location of the sstableloader binary if not in PATH>

; Enable this to add the '--ssl' parameter to nodetool. The nodetool-ssl.properties is expected to be in the normal location
;nodetool_ssl = true

; Command ran to verify if Cassandra is running on a node. Defaults to "nodetool version"
;check_running = nodetool version

; Disable/Enable ip address resolving.
; Disabling this can help when fqdn resolving gives different domain names for local and remote nodes
; which makes backup succeed but Medusa sees them as incomplete.
; Defaults to True.
resolve_ip_addresses = True

; When true, almost all commands executed by Medusa are prefixed with `sudo`.
; Does not affect the use_sudo_for_restore setting in the 'storage' section.
; See https://github.com/thelastpickle/cassandra-medusa/issues/318
; Defaults to True
;use_sudo = True

[storage]
storage_provider = local
; storage_provider should be either of "local", "google_storage", "azure_blobs" or the s3_* values from
; https://github.com/apache/libcloud/blob/trunk/libcloud/storage/types.py

; Name of the bucket used for storing backups
bucket_name = cassandra_backups

; JSON key file for service account with access to GCS bucket or AWS credentials file (home-dir/.aws/credentials)
key_file = /etc/medusa/credentials

; Path of the local storage bucket (used only with 'local' storage provider)
base_path = /var/lib/share/cassandra-bkp

; Any prefix used for multitenancy in the same bucket
;prefix = clusterA

;fqdn = <enforce the name of the local node. Computed automatically if not provided.>

; Number of days before backups are purged. 0 means backups don't get purged by age (default)
max_backup_age = 0
; Number of backups to retain. Older backups will get purged beyond that number. 0 means backups don't get purged by count (default)
max_backup_count = 0
; Both thresholds can be defined for backup purge.

; Used to throttle S3 backups/restores:
transfer_max_bandwidth = 50MB/s

; Max number of downloads/uploads. Not used by the GCS backend.
concurrent_transfers = 1

; Size over which S3 uploads will be using the awscli with multi part uploads. Defaults to 100MB.
multi_part_upload_threshold = 104857600

; GC grace period for backed up files. Prevents race conditions between purge and running backups
backup_grace_period_in_days = 10

; When not using sstableloader to restore data on a node, Medusa will copy snapshot files from a
; temporary location into the cassandra data directroy. Medusa will then attempt to change the
; ownership of the snapshot files so the cassandra user can access them.
; Depending on how users/file permissions are set up on the cassandra instance, the medusa user
; may need elevated permissions to manipulate the files in the cassandra data directory.
;
; This option does NOT replace the `use_sudo` option under the 'cassandra' section!
; See: https://github.com/thelastpickle/cassandra-medusa/pull/399
;
; Defaults to True
use_sudo_for_restore = True

;api_profile = <AWS profile to use>

;host = <Optional object storage host to connect to>
;port = <Optional object storage port to connect to>

; Configures the use of SSL to connect to the object storage system.
;secure = True

;aws_cli_path = <Location of the aws cli binary if not in PATH>

[monitoring]
;monitoring_provider = <Provider used for sending metrics. Currently either of "ffwd" or "local">

[ssh]
;username = <SSH username to use for restoring clusters>
key_file = /root/.ssh/id_rsa
;port = <SSH port for use for restoring clusters. Default to port 22.
;cert_file = <Path of public key signed certificate file to use for authentication. The corresponding private key must also be provided via key_file parameter>

[checks]
;health_check = <Which ports to check when verifying a node restored properly. Options are 'cql' (default), 'thrift', 'all'.>
;query = <CQL query to run after a restore to verify it went OK>
;expected_rows = <Number of rows expected to be returned when the query runs. Not checked if not specified.>
;expected_result = <Coma separated string representation of values returned by the query. Checks only 1st row returned, and only if specified>
;enable_md5_checks = <During backups and verify, use md5 calculations to determine file integrity (in addition to size, which is used by default)>

[logging]
; Controls file logging, disabled by default.
enabled = 1
file = medusa.log
level = DEBUG

; Control the log output format
; format = [%(asctime)s] %(levelname)s: %(message)s

; Size over which log file will rotate
; maxBytes = 20000000

; How many log files to keep
; backupCount = 50

[grpc]
; Set to true when running in grpc server mode.
; Allows to propagate the exceptions instead of exiting the program.
;enabled = False

[kubernetes]
; The following settings are only intended to be configured if Medusa is running in containers, preferably in Kubernetes.
;enabled = False
;cassandra_url = <URL of the management API snapshot endpoint. For example: http://127.0.0.1:8080/api/v0/ops/node/snapshots>

; Enables the use of the management API to create snapshots. Falls back to using Jolokia if not enabled.
;use_mgmt_api = True### Execution

Nodetool Status:

Datacenter: dc1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  111.222.3.24   447.29 KiB  256          100.0%            5c4c5b3f-4667-45d6-bb11-4d16a71u87ab  rack1
UN  111.222.3.104  429.54 KiB  256          100.0%            3fe14a56-b9ac-s9d3-9d7d-ee762c8b2621  rack1
UN  111.222.3.18   423.58 KiB  256          100.0%            e2b435c0-8ec0-2uc8-8106-987fe921491c  rack1
Datacenter: dc2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens       Owns (effective)  Host ID                               Rack
UN  111.222.3.152  368.61 KiB  256          100.0%            57c8kp0a-cc38-56ys-818b-200m67se9c6c  rack1
UN  111.222.3.20   333.64 KiB  256          100.0%            5e09csw6-20f3-4a35-8a06-3d0349ca5265  rack1
UN  111.222.3.12   328.95 KiB  256          100.0%            b233nk51-52cf-4f9c-bf0c-16f00ida55c0  rack1

Medusa Backup

sudo medusa --verbosity backup-cluster --backup-name=data070820221111 --mode=full

INFO: Monitoring provider is noop
DEBUG: Loading storage_provider: local
DEBUG: Blob node2-dc1.abcd.com/data070820221111/meta/schema.cql was not found in cache.
DEBUG: [Storage] Getting object node2-dc1.abcd.com/data070820221111/meta/schema.cql
INFO: Starting backup data070820221638
DEBUG: This server has systemd: True
WARNING: ssl_storage_port is deprecated as of Apache Cassandra 4.x
DEBUG: Checking placement using dc and rack...
INFO: Resolving ip address 111.222.3.24
INFO: ip address to resolve 111.222.3.24
DEBUG: Resolved 111.222.3.24 to node1-dc1.abcd.com
DEBUG: Checking host 111.222.3.24 against 111.222.3.24/node1-dc1.abcd.com
INFO: Resolving ip address 111.222.3.104
INFO: ip address to resolve 111.222.3.104
DEBUG: Resolved 111.222.3.104 to node2-dc1.abcd.com
INFO: Resolving ip address 111.222.3.24
INFO: ip address to resolve 111.222.3.24
DEBUG: Resolved 111.222.3.24 to node1-dc1.abcd.com
INFO: Resolving ip address 111.222.3.18
INFO: ip address to resolve 111.222.3.18
DEBUG: Resolved 111.222.3.18 to node3-dc1.abcd.com
INFO: Creating snapshots on all nodes
INFO: Executing "nodetool -Dcom.sun.jndi.rmiURLParsing=legacy -u nodetooladmin -pw nodetooladmin snapshot -t medusa-data070820221638" on following nodes ['node2-dc1.abcd.com', 'node1-dc1.abcd.com', 'node3-dc1.abcd.com'] with a parallelism/pool size of 500
DEBUG: Batch #1: Running "nodetool -Dcom.sun.jndi.rmiURLParsing=legacy -u nodetooladmin -pw nodetooladmin snapshot -t medusa-data070820221638" on nodes ['node2-dc1.abcd.com', 'node1-dc1.abcd.com', 'node3-dc1.abcd.com'] parallelism of 3
DEBUG: _run_command with read timeout None
DEBUG: Make client request for host node2-dc1.abcd.com, (host_i, host) in clients: False
DEBUG: Connecting to node2-dc1.abcd.com:22
DEBUG: _run_command with read timeout None
DEBUG: Make client request for host node1-dc1.abcd.com, (host_i, host) in clients: False
DEBUG: Connecting to node1-dc1.abcd.com:22
DEBUG: _run_command with read timeout None
DEBUG: Make client request for host node3-dc1.abcd.com, (host_i, host) in clients: False
DEBUG: Connecting to node3-dc1.abcd.com:22
DEBUG: Starting new session for root@node1-dc1.abcd.com:22
DEBUG: Session started, connecting with existing socket
DEBUG: Proceeding with private key file authentication
DEBUG: Authentication completed successfully - setting session to non-blocking mode
DEBUG: Opening new channel on node1-dc1.abcd.com
DEBUG: Channel open session blocked, waiting on socket..
DEBUG: Polling socket with timeout 100
DEBUG: Starting new session for root@node2-dc1.abcd.com:22
DEBUG: Session started, connecting with existing socket
DEBUG: Proceeding with private key file authentication
DEBUG: Authentication completed successfully - setting session to non-blocking mode
DEBUG: Opening new channel on node2-dc1.abcd.com
DEBUG: Channel open session blocked, waiting on socket..
DEBUG: Polling socket with timeout 100
DEBUG: Starting new session for root@node3-dc1.abcd.com:22
DEBUG: Session started, connecting with existing socket
DEBUG: Proceeding with private key file authentication
DEBUG: Authentication completed successfully - setting session to non-blocking mode
DEBUG: Opening new channel on node3-dc1.abcd.com
DEBUG: Channel open session blocked, waiting on socket..
DEBUG: Polling socket with timeout 100
DEBUG: Polling socket with timeout 100
DEBUG: Polling socket with timeout 100
DEBUG: Starting output generator on channel <ssh.channel.Channel object at 0x7f2d08a34cf0> for stdout
DEBUG: Polling socket with timeout 100
DEBUG: Starting output generator on channel <ssh.channel.Channel object at 0x7f2d08a34cf0> for stderr
DEBUG: Polling socket with timeout 100
DEBUG: Starting output generator on channel <ssh.channel.Channel object at 0x7f2d08a34558> for stdout
DEBUG: Polling socket with timeout 100
DEBUG: Starting output generator on channel <ssh.channel.Channel object at 0x7f2d08a34558> for stderr
DEBUG: Polling socket with timeout 100
DEBUG: Polling socket with timeout 100
DEBUG: Starting output generator on channel <ssh.channel.Channel object at 0x7f2d08a34828> for stdout
DEBUG: Polling socket with timeout 100
DEBUG: Starting output generator on channel <ssh.channel.Channel object at 0x7f2d08a34828> for stderr
DEBUG: Polling socket with timeout 100
DEBUG: Sending EOF on channel <ssh.channel.Channel object at 0x7f2d08a34558>
DEBUG: Waiting for readers, timeout None
DEBUG: Sending EOF on channel <ssh.channel.Channel object at 0x7f2d08a34cf0>
DEBUG: Waiting for readers, timeout None
DEBUG: Sending EOF on channel <ssh.channel.Channel object at 0x7f2d08a34828>
DEBUG: Waiting for readers, timeout None
DEBUG: No data for stdout, waiting
DEBUG: No data for stderr, waiting
DEBUG: No data for stdout, waiting
DEBUG: No data for stderr, waiting
DEBUG: No data for stdout, waiting
DEBUG: No data for stderr, waiting
DEBUG: Polling socket with timeout 100
...
DEBUG: Writing 145 bytes to stdout buffer
...
DEBUG: No data for stdout, waiting
DEBUG: Writing 145 bytes to stdout buffer
...
DEBUG: Polling socket with timeout 100
DEBUG: Writing 145 bytes to stdout buffer
...
DEBUG: No data for stderr, waiting
DEBUG: Writing 63 bytes to stdout buffer
DEBUG: Polling socket with timeout 100
DEBUG: Channel is at EOF trying to read stdout - reader exiting
DEBUG: Polling socket with timeout 100
...
DEBUG: Polling socket with timeout 100
DEBUG: Channel is at EOF trying to read stderr - reader exiting
DEBUG: Writing 63 bytes to stdout buffer
DEBUG: Polling socket with timeout 100
DEBUG: Channel is at EOF trying to read stderr - reader exiting
DEBUG: Polling socket with timeout 100
DEBUG: Polling socket with timeout 100
DEBUG: Channel is at EOF trying to read stdout - reader exiting
DEBUG: Readers finished, closing channel
DEBUG: Closing channel
DEBUG: No data for stderr, waiting
DEBUG: No data for stdout, waiting
DEBUG: Readers finished, closing channel
DEBUG: Closing channel
DEBUG: Polling socket with timeout 100
DEBUG: Polling socket with timeout 100
DEBUG: Channel is at EOF trying to read stderr - reader exiting
DEBUG: Writing 63 bytes to stdout buffer
DEBUG: Polling socket with timeout 100
DEBUG: Channel is at EOF trying to read stdout - reader exiting
DEBUG: Readers finished, closing channel
DEBUG: Closing channel
INFO: Job executing "nodetool -Dcom.sun.jndi.rmiURLParsing=legacy -u nodetooladmin -pw nodetooladmin snapshot -t medusa-data070820221638" ran and finished Successfully on all nodes.
INFO: A snapshot medusa-data070820221638 was created on all nodes.
INFO: Uploading snapshots from nodes to external storage
DEBUG: Running backup on all nodes with the following command mkdir -p /tmp/medusa-job-29409a3c-0fa4-4c00-8520-7a82c9859c21; cd /tmp/medusa-job-29409a3c-0fa4-4c00-8520-7a82c9859c21 && medusa-wrapper sudo medusa  -vvv backup-node --backup-name data070820221638   --mode full
INFO: Executing "mkdir -p /tmp/medusa-job-29409a3c-0fa4-4c00-8520-7a82c9859c21; cd /tmp/medusa-job-29409a3c-0fa4-4c00-8520-7a82c9859c21 && medusa-wrapper sudo medusa  -vvv backup-node --backup-name data070820221638   --mode full" on following nodes ['node2-dc1.abcd.com', 'node1-dc1.abcd.com', 'node3-dc1.abcd.com'] with a parallelism/pool size of 1
DEBUG: Batch #1: Running "mkdir -p /tmp/medusa-job-29409a3c-0fa4-4c00-8520-7a82c9859c21; cd /tmp/medusa-job-29409a3c-0fa4-4c00-8520-7a82c9859c21 && medusa-wrapper sudo medusa  -vvv backup-node --backup-name data070820221638   --mode full" on nodes ['node2-dc1.abcd.com'] parallelism of 1
DEBUG: _run_command with read timeout None
DEBUG: Make client request for host node2-dc1.abcd.com, (host_i, host) in clients: False
DEBUG: Connecting to node2-dc1.abcd.com:22
DEBUG: Starting new session for root@node2-dc1.abcd.com:22
DEBUG: Session started, connecting with existing socket
DEBUG: Proceeding with private key file authentication
DEBUG: Authentication completed successfully - setting session to non-blocking mode
DEBUG: Opening new channel on node2-dc1.abcd.com
DEBUG: Channel open session blocked, waiting on socket..
DEBUG: Polling socket with timeout 100
DEBUG: Polling socket with timeout 100
DEBUG: Starting output generator on channel <ssh.channel.Channel object at 0x7f2d08a34558> for stdout
DEBUG: Polling socket with timeout 100
DEBUG: Starting output generator on channel <ssh.channel.Channel object at 0x7f2d08a34558> for stderr
DEBUG: Polling socket with timeout 100
DEBUG: Sending EOF on channel <ssh.channel.Channel object at 0x7f2d08a34558>
DEBUG: Waiting for readers, timeout None
DEBUG: Polling socket with timeout 100
...
DEBUG: Polling socket with timeout 100
DEBUG: Channel is at EOF trying to read stdout - reader exiting
DEBUG: Channel is at EOF trying to read stderr - reader exiting
DEBUG: Readers finished, closing channel
DEBUG: Closing channel
DEBUG: Batch #1: Running "mkdir -p /tmp/medusa-job-29409a3c-0fa4-4c00-8520-7a82c9859c21; cd /tmp/medusa-job-29409a3c-0fa4-4c00-8520-7a82c9859c21 && medusa-wrapper sudo medusa  -vvv backup-node --backup-name data070820221638   --mode full" on nodes ['node1-dc1.abcd.com'] parallelism of 1
DEBUG: _run_command with read timeout None
DEBUG: Make client request for host node1-dc1.abcd.com, (host_i, host) in clients: False
DEBUG: Connecting to node1-dc1.abcd.com:22
DEBUG: Starting new session for root@node1-dc1.abcd.com:22
DEBUG: Session started, connecting with existing socket
DEBUG: Proceeding with private key file authentication
DEBUG: Authentication completed successfully - setting session to non-blocking mode
DEBUG: Opening new channel on node1-dc1.abcd.com
DEBUG: Channel open session blocked, waiting on socket..
DEBUG: Polling socket with timeout 100
DEBUG: Polling socket with timeout 100
DEBUG: Starting output generator on channel <ssh.channel.Channel object at 0x7f2d08a34828> for stdout
DEBUG: Polling socket with timeout 100
DEBUG: Starting output generator on channel <ssh.channel.Channel object at 0x7f2d08a34828> for stderr
DEBUG: Polling socket with timeout 100
DEBUG: Sending EOF on channel <ssh.channel.Channel object at 0x7f2d08a34828>
DEBUG: Waiting for readers, timeout None
DEBUG: No data for stdout, waiting
...
DEBUG: Polling socket with timeout 100
DEBUG: Channel is at EOF trying to read stdout - reader exiting
DEBUG: Channel is at EOF trying to read stderr - reader exiting
DEBUG: Readers finished, closing channel
DEBUG: Closing channel
DEBUG: Batch #1: Running "mkdir -p /tmp/medusa-job-29409a3c-0fa4-4c00-8520-7a82c9859c21; cd /tmp/medusa-job-29409a3c-0fa4-4c00-8520-7a82c9859c21 && medusa-wrapper sudo medusa  -vvv backup-node --backup-name data070820221638   --mode full" on nodes ['node3-dc1.abcd.com'] parallelism of 1
DEBUG: _run_command with read timeout None
DEBUG: Make client request for host node3-dc1.abcd.com, (host_i, host) in clients: False
DEBUG: Connecting to node3-dc1.abcd.com:22
DEBUG: Starting new session for root@node3-dc1.abcd.com:22
DEBUG: Session started, connecting with existing socket
DEBUG: Proceeding with private key file authentication
DEBUG: Authentication completed successfully - setting session to non-blocking mode
DEBUG: Opening new channel on node3-dc1.abcd.com
DEBUG: Channel open session blocked, waiting on socket..
DEBUG: Polling socket with timeout 100
DEBUG: Polling socket with timeout 100
DEBUG: Starting output generator on channel <ssh.channel.Channel object at 0x7f2d08a345a0> for stdout
DEBUG: Polling socket with timeout 100
DEBUG: Starting output generator on channel <ssh.channel.Channel object at 0x7f2d08a345a0> for stderr
DEBUG: Polling socket with timeout 100
DEBUG: Sending EOF on channel <ssh.channel.Channel object at 0x7f2d08a345a0>
DEBUG: Waiting for readers, timeout None
DEBUG: No data for stdout, waiting
...
DEBUG: Polling socket with timeout 100
DEBUG: Channel is at EOF trying to read stdout - reader exiting
DEBUG: Channel is at EOF trying to read stderr - reader exiting
DEBUG: Readers finished, closing channel
DEBUG: Closing channel
INFO: Job executing "mkdir -p /tmp/medusa-job-29409a3c-0fa4-4c00-8520-7a82c9859c21; cd /tmp/medusa-job-29409a3c-0fa4-4c00-8520-7a82c9859c21 && medusa-wrapper sudo medusa  -vvv backup-node --backup-name data070820221638   --mode full" ran and finished Successfully on all nodes.
INFO: A new backup data070820221638 was created on all nodes.
DEBUG: Emitting metrics
INFO: Backup duration: 23
DEBUG: Done emitting metrics.
INFO: Backup of the cluster done.

Please do let us know if any other information is required.

┆Issue is synchronized with this Jira Story by Unito

adejanovski commented 2 years ago

Hi @kaushalkumar,

you did nothing wrong, it's the backup-cluster which is badly named. Medusa performs backups per datacenter, not per cluster. If you have multiple DCs, you'll have to back them up separately by running the operation in all of them. We will very soon rename it to backup-datacenter to make it more obvious.

kaushalkumar commented 2 years ago

Hi @adejanovski - Thanks for responding. It was helpful

While waiting for the response, we tried to look into the code. I am not an expert in python and medusa, but it seems that an enhancement in cassandra_utils https://github.com/thelastpickle/cassandra-medusa/blob/master/medusa/cassandra_utils.py#L171 might enable this functionality. Perhaps instead of renaming, it might be worth to explore supporting this feature by enhancing/overloading the backup-cluster api to provide an option of backup at dc level or cluster level.

This enhancement will demand changes in other areas, like restore, list, delete etc. So, i think you will be the best person to decide if this change is adaptable (based on medusa's design) or not.