Some technical questions I got about the tool

ftarrega commented 5 months ago

Hey team. I got in touch with Datastax support on their webpage chat cause I couldn't find an email or forum to ask you questions. I felt opening an issue would be frowned upon for such request, but I've been told by them this is indeed the place so here I am =o)

I got a few questions about Cassandra Medusa I'd appreciate answered as we are trying to implement it in our project, which will lead to ultimately using medusa for production clusters backups and restores:

Is Medusa ready for PROD use? We were wondering about that cause medusa is still version 0 and hasn't reached v1 yet and in such cases it's usually because a tool isn't mature enough yet, right? Is that the case?
Do you have a timeline for when you expect Medusa v1.X.X will be generally available?
How does medusa handle the snapshots as the backup is getting created? Snapshots take a lot of storage space and are saved to the data folder, in practice competing for the resource with the actual data. My concern is it filling up the file system, having a compaction kick in and the server start failing. So what's the flow? Will it take all snapshots and clear them only at the end of the backup or does it do it table for table?
There's a rule of thumb you learn from researching that you should keep from 50% to 75% free space in storage for Cassandra smooth operation. Because of snapshots and the time a backup can take to finish does it mean one should need more than 50% free disk space in order to not trouble the node when medusa is running?
Currently we are clocking about 10TB backup at about 28h for a 4-nodes cluster. transfer_max_bandwidth = 1000MB/s and concurrent_transfers = 16 ( it was set to 4 before, we were getting lots of "WARNING: Connection pool is full, discarding connection:" messages, so we changed to 16. We are still getting those messages but less than before, it seems). It didn't seem to improve performance, so maybe the issue is because snapshots are in fact taking the brunt of the backup time and we need to tweak Cassandra somehow? Any advice on how we can fine-tune a first time --mode=full backup?

Thanks in advance and best regards

adejanovski commented 5 months ago

Is Medusa ready for PROD use? We were wondering about that cause medusa is still version 0 and hasn't reached v1 yet and in such cases it's usually because a tool isn't mature enough yet, right? Is that the case? Do you have a timeline for when you expect Medusa v1.X.X will be generally available?

It is. We never got to a point where it felt like moving to v1 although it's been used for years in production by many companies. The current thinking is that we're going to remove the "full" backup mode and keep the "differential" only (although we recently realized what Medusa does is "synthetic full backups"). The removal of a backup mode feels like a good moment to finally move to v1. This should happen within a few months.

How does medusa handle the snapshots as the backup is getting created? Snapshots take a lot of storage space and are saved to the data folder, in practice competing for the resource with the actual data. My concern is it filling up the file system, having a compaction kick in and the server start failing. So what's the flow? Will it take all snapshots and clear them only at the end of the backup or does it do it table for table?

Snapshots are hardlinks, they only take space if the original sstable file is deleted. As we need to take synchronized backups, if you start the backup using medusa backup-cluster, a snapshot will be taken on all nodes, and only after will the uploads start. They're cleared only after the end of the backup for all tables. We use nodetool to create and clear the snapshots, which is why it's done for all files at once. Your idea of deleting files after their upload is very interesting though and deserves to be investigated. It would limit the odds of running out of disk space on long running backups.

There's a rule of thumb you learn from researching that you should keep from 50% to 75% free space in storage for Cassandra smooth operation. Because of snapshots and the time a backup can take to finish does it mean one should need more than 50% free disk space in order to not trouble the node when medusa is running?

It depends a lot on the write/compaction flow (which will have an impact on whether or not the snapshots will be persisted) and the duration of backups. There's no general rule we can come up with here.

Currently we are clocking about 10TB backup at about 28h for a 4-nodes cluster. transfer_max_bandwidth = 1000MB/s and concurrent_transfers = 16 ( it was set to 4 before, we were getting lots of "WARNING: Connection pool is full, discarding connection:" messages, so we changed to 16. We are still getting those messages but less than before, it seems). It didn't seem to improve performance, so maybe the issue is because snapshots are in fact taking the brunt of the backup time and we need to tweak Cassandra somehow? Any advice on how we can fine-tune a first time --mode=full backup?

You don't need to run "full" on the first backup. It's also inefficient if you switch to differential afterwards because the first differential will re-upload everything. We recently released a new version which heals broken differential backups, which was really the only reason why full could be considered safer. Avoiding upload of already uploaded files is really the best optimization you can get.

Do you run backups through the command line and using the medusa backup-cluster command? Any reason why you want to use full as backup mode? Which version of Medusa are you using? What throughput do you effectively observe during backups?

rzvoncek commented 5 months ago

Hi @ftarrega ! I'd also like to know what's your data churn rate, meaning how much of new data do you foresee getting between two backups?

Aside from the first backup, Medusa will only upload things that were not present in the previous backup (just as Alex mentioned). If you don't have much data coming in, nor too much compaction activity, this might mean the backups will actually not be very long. So the lifetime of the snapshots (and the disk usage associated with them) might be smaller than you think.

adejanovski commented 5 months ago

@rzvoncek, I really think this is a good idea. We've seen clusters running out of disk space due to snapshots being persisted as backups were running and the change should be fairly light in our code.

ftarrega commented 5 months ago

Hey, guys. Hope you're all well.

First of all thanks a lot for taking the time and effort to answer me. I'm still reviewing it, but I'd like to answer to Alex's questions right away, so:

Do you run backups through the command line and using the medusa backup-cluster command? Through the command-line yes, but not with backup-cluster. As a policy in the company nodes are not set up for remote connection between themselves. As far as I could gather from the code, backup-cluster requires pssh and allowing it to work could be seem as a security risk within the company, so we are triggering medusa backup with the same backup name in each of the cluster nodes

Any reason why you want to use full as backup mode? This could be from my misunderstanding on how to properly use the tool, but I learned this from an article you wrote on medusa in 2019 (Medusa - Spotify’s Apache Cassandra backup tool is now open source (thelastpickle.com)https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html): [cid:6fadc31d-bd59-40f9-b37a-afb95537198d] I figured for a first run ever on any cluster one should run a full backup and then keep track of the changes with differential backups as a follow-up. I thought if we just went for a differential one from the get go I'd be telling medusa not to backup all the data in the cluster and I'd be at fault for the missing data

Which version of Medusa are you using? I see you guys have just released v0.20.0 and v0.20.1, but when I started this endeavour v0.19.1 was the latest available. That's the version we are currently using. Do you advise we upgrade? I suppose there's some relearning I'll have to do in such case, cause I don't see the setup.py file in the repo anymore for these latest versions. I don't know much about python, but I was able to make medusa work from the Linux cassandra account using miniconda to setup a python 3.8.18 env (couldn't figure pipenv quickly enough :o), run the pip download and pip install, then python setup.py install, finally making python see all dependencies by adding an export PYTHONPATH to the bash_profile of cassandra

What throughput do you effectively observe during backups? Here's what I could draw from AWS metric system.net.bytes_sent. I'd say that, on average, it clocked at no more than 50MiB/s:

[cid:21affe7c-6e05-46d0-a209-e1760af0cb90]

@rzvoncekhttps://github.com/rzvoncek, I understand the churn will be somewhat high, cause the data holds B2B and B2C customers. We tend to have changes coming in every day as CRUD requests in many different tables across keyspaces

Thanks again and best regards

From: Alexander Dejanovski @.> Sent: Thursday, April 4, 2024 2:40 AM To: thelastpickle/cassandra-medusa @.> Cc: ftarrega @.>; Author @.> Subject: Re: [thelastpickle/cassandra-medusa] Some technical questions I got about the tool (Issue #734)

Is Medusa ready for PROD use? We were wondering about that cause medusa is still version 0 and hasn't reached v1 yet and in such cases it's usually because a tool isn't mature enough yet, right? Is that the case? Do you have a timeline for when you expect Medusa v1.X.X will be generally available?

It is. We never got to a point where it felt like moving to v1 although it's been used for years in production by many companies. The current thinking is that we're going to remove the "full" backup mode and keep the "differential" only (although we recently realized what Medusa does is "synthetic full backups"). The removal of a backup mode feels like a good moment to finally move to v1. This should happen within a few months.

How does medusa handle the snapshots as the backup is getting created? Snapshots take a lot of storage space and are saved to the data folder, in practice competing for the resource with the actual data. My concern is it filling up the file system, having a compaction kick in and the server start failing. So what's the flow? Will it take all snapshots and clear them only at the end of the backup or does it do it table for table?

Snapshots are hardlinks, they only take space if the original sstable file is deleted. As we need to take synchronized backups, if you start the backup using medusa backup-cluster, a snapshot will be taken on all nodes, and only after will the uploads start. They're cleared only after the end of the backup for all tables. We use nodetool to create and clear the snapshots, which is why it's done for all files at once. Your idea of deleting files after their upload is very interesting though and deserves to be investigated. It would limit the odds of running out of disk space on long running backups.

There's a rule of thumb you learn from researching that you should keep from 50% to 75% free space in storage for Cassandra smooth operation. Because of snapshots and the time a backup can take to finish does it mean one should need more than 50% free disk space in order to not trouble the node when medusa is running?

It depends a lot on the write/compaction flow (which will have an impact on whether or not the snapshots will be persisted) and the duration of backups. There's no general rule we can come up with here.

Currently we are clocking about 10TB backup at about 28h for a 4-nodes cluster. transfer_max_bandwidth = 1000MB/s and concurrent_transfers = 16 ( it was set to 4 before, we were getting lots of "WARNING: Connection pool is full, discarding connection:" messages, so we changed to 16. We are still getting those messages but less than before, it seems). It didn't seem to improve performance, so maybe the issue is because snapshots are in fact taking the brunt of the backup time and we need to tweak Cassandra somehow? Any advice on how we can fine-tune a first time --mode=full backup?

You don't need to run "full" on the first backup. It's also inefficient if you switch to differential afterwards because the first differential will re-upload everything. We recently released a new version which heals broken differential backups, which was really the only reason why full could be considered safer. Avoiding upload of already uploaded files is really the best optimization you can get.

Do you run backups through the command line and using the medusa backup-cluster command? Any reason why you want to use full as backup mode? Which version of Medusa are you using? What throughput do you effectively observe during backups?

— Reply to this email directly, view it on GitHubhttps://github.com/thelastpickle/cassandra-medusa/issues/734#issuecomment-2036229653, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AGQ3N4PCJ7QSQNKGOIIT6D3Y3TRTRAVCNFSM6AAAAABFVSPBJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMZWGIZDSNRVGM. You are receiving this because you authored the thread.Message ID: @.***>

ftarrega commented 5 months ago

Hey, guys. just a remark about my last reply, I did it from Outlook and those "cid" references were in fact screenshots I pasted. I didn't know GitHub removes it when pasting the email here. Here's the pics. Hopefully they'll be visible now

adejanovski commented 5 months ago

This could be from my misunderstanding on how to properly use the tool, but I learned this from an article you wrote on medusa in 2019 (Medusa - Spotify’s Apache Cassandra backup tool is now open source (thelastpickle.com)https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html): [cid:6fadc31d-bd59-40f9-b37a-afb95537198d] I figured for a first run ever on any cluster one should run a full backup and then keep track of the changes with differential backups as a follow-up. I thought if we just went for a differential one from the get go I'd be telling medusa not to backup all the data in the cluster and I'd be at fault for the missing data

That's a common misconception, so I clearly did a poor job at explaining this 😅 When differential backups run for the first time on a cluster, they upload everything. Subsequent backups only upload new files since the previous backup. This blog post is also a little outdated and we used to copy directly from folders in the bucket with the GCS backend only. We recently stopped doing it altogether to simplify our codebase and in prep for a transition to differential (well, synthetic) becoming the only backup mode.

I see you guys have just released v0.20.0 and v0.20.1, but when I started this endeavour v0.19.1 was the latest available. That's the version we are currently using. Do you advise we upgrade?

We've had some upload concurrency issues that were fixed in v0.20.x but they were affecting the grpc server I think, so it shouldn't have an impact on your setup. One interesting feature there though is the self healing of broken differential backups. We'll re-upload any file that is missing/corrupted in the storage bucket even if it was previously uploaded as part of an older backup.

Here's what I could draw from AWS metric system.net.bytes_sent. I'd say that, on average, it clocked at no more than 50MiB/s:

Well, that would suggest your 1000MB/s throughput limit isn't being taken into account and the default throttling at 50MB/s is still applied 🤔 Could you share your medusa.ini file with us?

rzvoncek commented 5 months ago

I don't see the setup.py file in the repo anymore ... using miniconda to setup a python 3.8.18 env (couldn't figure pipenv quickly enough :o), run the pip download and pip install, then python setup.py install, finally making python see all dependencies by adding an export PYTHONPATH to the bash_profile of cassandra

Yes, we removed the setup.py together with requirements.txt because they both featured the project requirements and we had to maintain both. We replaced them the pyproject.toml file. The pip install cassandra-medusa==0.20.1 should still work though. You might still need to some of the post-installation config though.

ftarrega commented 4 months ago

Hey, lads. How are you?

Just dropping by to share the medusa.ini file requested:

; Copyright 2019 Spotify AB. All rights reserved.
;
; Licensed under the Apache License, Version 2.0 (the "License");
; you may not use this file except in compliance with the License.
; You may obtain a copy of the License at
;
; http://www.apache.org/licenses/LICENSE-2.0
;
; Unless required by applicable law or agreed to in writing, software
; distributed under the License is distributed on an "AS IS" BASIS,
; WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
; See the License for the specific language governing permissions and
; limitations under the License.

[cassandra]
stop_cmd = /opt/apps/cassandra/cassandra/bin/stop-server
start_cmd = /opt/apps/cassandra/cassandra/bin/cassandra
config_file = /opt/apps/cassandra/cassandra/conf/cassandra.yaml
cql_username = ***********
cql_password = ***********
; When using the following setting there must be files in:
; - `<cql_k8s_secrets_path>/username` containing username
; - `<cql_k8s_secrets_path>/password` containing password
;cql_k8s_secrets_path = <path to kubernetes secrets folder>
nodetool_username = ***********
nodetool_password = ***********
;nodetool_password_file_path = <path to nodetool password file>
;nodetool_k8s_secrets_path = <path to nodetool kubernetes secrets folder>
nodetool_host = XX.XX.XX.XX
nodetool_port = 7199
;certfile= <Client SSL: path to rootCa certificate>
;usercert= <Client SSL: path to user certificate>
;userkey= <Client SSL: path to user key>
;sstableloader_ts = <Client SSL: full path to truststore>
;sstableloader_tspw = <Client SSL: password of the truststore>
;sstableloader_ks = <Client SSL: full path to keystore>
;sstableloader_kspw = <Client SSL: password of the keystore>
;sstableloader_bin = <Location of the sstableloader binary if not in PATH>

; Enable this to add the '--ssl' parameter to nodetool. The nodetool-ssl.properties is expected to be in the normal location
;nodetool_ssl = true

; Command ran to verify if Cassandra is running on a node. Defaults to "nodetool version"
;check_running = nodetool version

; Disable/Enable ip address resolving.
; Disabling this can help when fqdn resolving gives different domain names for local and remote nodes
; which makes backup succeed but Medusa sees them as incomplete.
; Defaults to True.
resolve_ip_addresses = True

; When true, almost all commands executed by Medusa are prefixed with `sudo`.
; Does not affect the use_sudo_for_restore setting in the 'storage' section.
; See https://github.com/thelastpickle/cassandra-medusa/issues/318
; Defaults to True
;use_sudo = True

[storage]
storage_provider = ***********
; storage_provider should be either of "local", "google_storage", "azure_blobs" or the s3_* values from
; https://github.com/apache/libcloud/blob/trunk/libcloud/storage/types.py

; Name of the bucket used for storing backups
bucket_name = ***********

; JSON key file for service account with access to GCS bucket or AWS credentials file (home-dir/.aws/credentials)
;key_file = /etc/medusa/credentials

; Path of the local storage bucket (used only with 'local' storage provider)
;base_path = /path/to/backups

; Any prefix used for multitenancy in the same bucket
prefix = system_env

;fqdn = <enforce the name of the local node. Computed automatically if not provided.>

; Number of days before backups are purged. 0 means backups don't get purged by age (default)
max_backup_age = 7
; Number of backups to retain. Older backups will get purged beyond that number. 0 means backups don't get purged by count (default)
max_backup_count = 4
; Both thresholds can be defined for backup purge.

; Used to throttle S3 backups/restores:
transfer_max_bandwidth = 1000MB/s

; Max number of concurrent downloads/uploads.
concurrent_transfers = 16

; Size over which uploads will be using multi part uploads. Defaults to 20MB.
multi_part_upload_threshold = 104857600

; GC grace period for backed up files. Prevents race conditions between purge and running backups
backup_grace_period_in_days = 10

; When not using sstableloader to restore data on a node, Medusa will copy snapshot files from a
; temporary location into the cassandra data directroy. Medusa will then attempt to change the
; ownership of the snapshot files so the cassandra user can access them.
; Depending on how users/file permissions are set up on the cassandra instance, the medusa user
; may need elevated permissions to manipulate the files in the cassandra data directory.
;
; This option does NOT replace the `use_sudo` option under the 'cassandra' section!
; See: https://github.com/thelastpickle/cassandra-medusa/pull/399
;
; Defaults to True
use_sudo_for_restore = False

;api_profile = <AWS profile to use>

;host = <Optional object storage host to connect to>
;port = <Optional object storage port to connect to>

; Configures the use of SSL to connect to the object storage system.
;secure = True

; Enables verification of certificates used in case secure is set to True.
; Enabling this is not yet supported - we don't have a good way to configure paths to custom certificates.
; ssl_verify = False

;aws_cli_path = <Location of the aws cli binary if not in PATH>

[monitoring]
;monitoring_provider = <Provider used for sending metrics. Currently either of "ffwd" or "local">

[ssh]
;username = <SSH username to use for restoring clusters>
;key_file = <Path of SSH key for use for restoring clusters. Expected in PEM unencrypted format.>
;port = <SSH port for use for restoring clusters. Default to port 22.
;cert_file = <Path of public key signed certificate file to use for authentication. The corresponding private key must also be provided via key_file parameter>

[checks]
;health_check = <Which ports to check when verifying a node restored properly. Options are 'cql' (default), 'thrift', 'all'.>
;query = <CQL query to run after a restore to verify it went OK>
;expected_rows = <Number of rows expected to be returned when the query runs. Not checked if not specified.>
;expected_result = <Coma separated string representation of values returned by the query. Checks only 1st row returned, and only if specified>
;enable_md5_checks = <During backups and verify, use md5 calculations to determine file integrity (in addition to size, which is used by default)>

[logging]
; Controls file logging, disabled by default.
enabled = 1
file = /opt/apps/cassandra-medusa/logs/medusa.log
level = DEBUG

; Control the log output format
format = [%(asctime)s] %(levelname)s: %(message)s

; Size over which log file will rotate
maxBytes = 20000000

; How many log files to keep
backupCount = 500

[grpc]
; Set to true when running in grpc server mode.
; Allows to propagate the exceptions instead of exiting the program.
;enabled = False

[kubernetes]
; The following settings are only intended to be configured if Medusa is running in containers, preferably in Kubernetes.
;enabled = False
;cassandra_url = <URL of the management API snapshot endpoint. For example: http://127.0.0.1:8080/api/v0/ops/node/snapshots>

; Enables the use of the management API to create snapshots. Falls back to using Jolokia if not enabled.
;use_mgmt_api = True
;ca_cert = mutual_auth_ca.pem
;tls_cert = mutual_auth_client.crt
;tls_key = mutual_auth_client.key

ftarrega commented 2 months ago

Hey lads, what's up? It's been a while.

Let me take this opportunity wherein this ticket is still open to ask you some new questions rather than cluttering you ticket system with a new ticket. I've been wondering recently about encryption. Does medusa ship with backup encryption capabilities as it's saved to an S3 bucket for instance, or as you were designing it you felt it was better served if left to Cassandra or even to the applications reading from and writing to the cluster to do it?

Thanks in advance

rzvoncek commented 2 months ago

Hey @ftarrega, well, we're all busy, you probably understand :-/

Looking at the ini file, there are 4 settings that might impact the upload speed:

storage_provider = ***********
transfer_max_bandwidth = 1000MB/s
concurrent_transfers = 16
multi_part_upload_threshold = 104857600

The storage_provider is important because some advanced features like throughput throttling (and encryption, more later) work the best in the S3 one. Google and Azure do not behave the same yet. But we've seen all of them do more than 50 MB/s in our tests.
The transfer_max_bandwidth is high enough and should not stand in our way.
The concurrent_transfers seems to be quite convoluted. The first thing it does, regardless of the storage provider, is to determine the number of files to process in parallel when uploading or downloading. Each group of concurrently processed files has to have all files processed before the next group starts (so the groups are synchronous, example).
Then, the concurrent_transfers has different meaning per storage provider. For Google, it has no extra menaing. For Azure, we pass it to the SDK library we use if the file is bigger than the multipart threshold (100MB in your case). For S3, this controls the size of the executor we submit trasnfer tasks into. We do not propagete is to the boto's concurrency parameter.
Using the concurrency setting differently per storage provider is mostly because they all work differently so we kinda have to adjust.
Finally, the multi_part_upload_threshold controls when to splitting files into several parts before transfering them. It again, differs. For Azure, we use it in the decision wether to use multiple workers for upload. In other two providers, we actually ignore it. This is because we found out that it's better to have the client libraries deal with multi-parting themselves.

To sum this up, tuning the throughpput comes down to how your files look like. If you have many small files, you'll need more concurrency. And remove the transfer_max_bandwidth, as we have little experience with it and it might be buggy. Outside of Medusa, also check for any other system limits. Medusa can easily exhaust the throughput available to a point where the network interface is the bottle neck.

For encryption, there are two features we have, both only for S3:

you can set secure to True in the config, which will make boto use HTTPS instad of HTTP. If you run into issues with certificates, set ssl_verify to False.
you can provide the kms_id which should be an UUID (or ARN, not sure) of the AWS KMS key to use when creating the S3 object.

For the other providers, you might be able to set up some encryption-at-rest and it might work transparently.

ftarrega commented 1 month ago

Hi there, guys. Hope you're doing fine.

My apologies, @rzvoncek, I didn't mean for my introductory comment to come across as pushy. It was just a bit of smalltalk really, as it had been a while since I visited the ticket too. I'm sure you guys are hard at work :-)

Anyway, thanks a lot for your reply. I'm gonna close this ticket and open a fresh one as I need some help with an issue I'm getting, if that's OK.

thelastpickle / cassandra-medusa

Some technical questions I got about the tool #734