tbarbugli / cassandra_snapshotter

A tool to backup cassandra nodes using snapshots and incremental backups on S3
Other
222 stars 122 forks source link

Improve restore #93

Closed rhardouin closed 5 years ago

rhardouin commented 8 years ago

The most important commit is "Fix restore: use absolute path" (665a97b). I was unable to restore with relative path because the script tried to download files in the current working directory instead of --cassandra-data-dir i.e. where the directory tree is created.

Also add:

rhardouin commented 8 years ago

Let me know if you want me to bump the version in this PR (let's say 1.0.1?)

tbarbugli commented 8 years ago

looks great, thank you for the contribution and remember to add yourself to the list of contributors ;)

rhardouin commented 8 years ago

Ok :) I'll will push some other improvements. Instead of no-sstableloader I will add an option --download-only mutually exclusive with --target-hosts. I'll also update the doc (README)

rhardouin commented 8 years ago

Add --local to perform a local restore

--local allows to restore data directly on the local server where the command is run. The filenames are not prefixed by <HOST>_ because we restore from only one node in this mode, so it would be useless.

Rename --cassandra-data-dir to --restore-dir because:

NOTE

I'm aware this is a breaking change but I find --cassandra-data-dir really misleading for sstableloader restore and dangerous for local restore. Because when you ask people to set --cassandra-data-dir ... they will set the real C* data directory. And currently the keyspace directory under this directory is deleted! (in _delete_old_dir_and_create_new).

rhardouin commented 8 years ago

FYI here is the help ouput now:

$ cassandra-snapshotter restore -h
usage: cassandra-snapshotter restore [-h] [--snapshot-name SNAPSHOT_NAME]
                                     --keyspace KEYSPACE [--table TABLE]
                                     [--hosts HOSTS]
                                     [--cassandra-bin-dir CASSANDRA_BIN_DIR]
                                     [--restore-dir RESTORE_DIR]
                                     (--target-hosts TARGET_HOSTS | --local | --no-sstableloader)

optional arguments:
  -h, --help            show this help message and exit
  --snapshot-name SNAPSHOT_NAME
                        The name (date/time) of the snapshot (and
                        incrementals) to restore
  --keyspace KEYSPACE   The keyspace to restore
  --table TABLE         The table (column family) to restore; leave blank for
                        all
  --hosts HOSTS         Comma separated list of hosts to restore from; leave
                        empty for all. Only one host allowed when using
                        --local.
  --cassandra-bin-dir CASSANDRA_BIN_DIR
                        cassandra binaries directory
  --restore-dir RESTORE_DIR
                        Directory where data will be downloaded. If --target-
                        hosts is passed, sstableloader will stream data from
                        this directory.
  --target-hosts TARGET_HOSTS
                        The comma separated list of hosts to restore into
  --local               Do not run sstableloader when restoring. If set, files
                        will just be downloaded and decompressed in --restore-
                        dir.
  --no-sstableloader    Do not run sstableloader when restoring. If set, files
                        will just be downloaded. Use it if you want to do some
                        checks and then run sstableloader manually.

We see that --target-hosts, --local and no-sstableloader are mutually exlusive:

(--target-hosts TARGET_HOSTS | --local | --no-sstableloader)
awheeler commented 8 years ago

Any reason not to accept this request?

tbarbugli commented 8 years ago

@awheeler I have no time to review and test this properly; hopefully this will change next week

rhardouin commented 7 years ago

@tbarbugli Travis fails because --use-mirrors doesn't exist in pip 8.1.2 (removed in pip 7.0.0). The previous builds used pip 6.

awheeler commented 7 years ago

I'm running into several problems with the incremental backups using python 2.7:

  1. It's unclear to me how to restore a given incremental backup, or if that's possible.
  2. When deciding whether to do incremental or snapshot, the snapshot host, keyspace, and table lists are in unicode (from json.loads), while the env hosts, keyspace and table are not. Additionally, they are not sorted, so mismatch is the norm (I'm getting my hostlist from a DNS query, so it's not inherently sorted).
  3. Snapshots are always deleted at the end of the run, so it's impossible to get incrementals going in the first-place. I've added a --keep-snapshot flag to overcome this, but I think there's probably a cleaner solution.

I have solved problems 2 and 3 (though not updated my fork yet), and my situation doesn't require solving 1.

rhardouin commented 7 years ago

We don't use incremental backups in production that's why I don't encountered the problem. It would be great if you can fix that in a PR.

lmammino commented 7 years ago

What's the status on this PR? It seems it has been going on for almost a year now...

rhardouin commented 7 years ago

@lmammino I think that it would be cool if you can test it on your side (again, without incremental backups). I'm using it on production and I would be glad to see it merged on master but there are lots of changes. So if we can validate that it works in several environments it would be safer.

lmammino commented 7 years ago

@rhardouin Sure. I'll try to update my installed version (which I found out to be 1.0 from pip) to the one in this branch and provide some feedback. Hopefully I can provide some during the end of the day!

lmammino commented 7 years ago

@rhardouin I also submitted you a PR to get rid of the current conflict with master. This might speed up the integration process, please take few minutes to review and integrate it here as that's the version I'll be testing against.

techfort commented 7 years ago

@rhardouin @lmammino this looks great and i'd be ecstatic to see this merged to master and on available on pip as I am in need of something exactly like this. I'm testing this myself and will report with feedback asap.

tbarbugli commented 7 years ago

@techfort @lmammino looking forward to hear about your tests ;)

lmammino commented 7 years ago

So, unfortunately I couldn't get this to work.

Let me recap all the steps I followed and the errors I got, because, given me still being a noob with Cassandra, is very likely that I am doing something wrong:

So, I have a 3 node cluster with a test table with about 60.000 records and a cassandra-backup EC2 instance, which only goal is to have casssandra-snapshotter installed to trigger the creation of snapshots.

That's the command I use in the cassandra-backup machine to take the first snapshot (at this stage I am not even interested in incremental backups):

cassandra-snapshotter \
  -v \
  --s3-bucket-name=XXXXXX \
  --s3-bucket-region=eu-west-1 \
  --s3-base-path=backups \
  --aws-access-key-id=YYYYY \
  --aws-secret-access-key=ZZZZZZZZ \
  backup \
  --cqlsh-password=CCCCCC \
  --cqlsh-user=DDDDDD \
  --hosts=10.0.2.252,10.0.3.252,10.0.4.252 \
  --cassandra-conf-path=/etc/cassandra

I can confirm that this command works, as I can see all the files from the latest snapshot in my S3 bucket following the expected folder structure for each one of the 3 nodes.

Then, I just truncate the test table in one of the machines in the cluster using cqlsh. After few seconds I verify that all the nodes have 0 records in the test table.

At this stage I suppose I am ready to trigger a restore from the backup machine. That's the command I use:

cassandra-snapshotter \
  -v \
  --s3-bucket-name=XXXXXX \
  --s3-bucket-region=eu-west-1 \
  --s3-base-path=backups \
  --aws-access-key-id=YYYYY \
  --aws-secret-access-key=ZZZZZZZZ \
  restore \
  --hosts=10.0.3.252,10,0.4.252 \
  --target-hosts=10.0.2.253 \
  --snapshot-name=20170622151202 \
  --keyspace=test

and its output:

Restoring keyspace: test  from backup 20170622151202
Backup files of the following host(s) will be downloaded in '/tmp/restore_cassandra/': 10.0.3.252, 10, 0.4.252.
After the downloading data will be streamed to the following host(s) via sstableloader: 10.0.2.253.
Deleting directory (/tmp/restore_cassandra/test)...
Found 78 files, with total size of 82.8MB.
Found 78 files, with total size of 82.8MB.
Starting to download...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_manifest.json.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-3-big-CompressionInfo.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-3-big-Data.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-3-big-Digest.crc32.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-3-big-Filter.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-3-big-Index.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-3-big-Statistics.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-3-big-Summary.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-3-big-TOC.txt.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-4-big-CompressionInfo.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-4-big-Data.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-4-big-Digest.crc32.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-4-big-Filter.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-4-big-Index.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-4-big-Statistics.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-4-big-Summary.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-4-big-TOC.txt.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-5-big-CompressionInfo.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-5-big-Data.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-5-big-Digest.crc32.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-5-big-Filter.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-5-big-Index.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-5-big-Statistics.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-5-big-Summary.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_mc-5-big-TOC.txt.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.2.252_schema.cql.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_manifest.json.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-3-big-CompressionInfo.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-3-big-Data.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-3-big-Digest.crc32.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-3-big-Filter.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-3-big-Index.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-3-big-Statistics.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-3-big-Summary.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-3-big-TOC.txt.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-4-big-CompressionInfo.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-4-big-Data.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-4-big-Digest.crc32.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-4-big-Filter.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-4-big-Index.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-4-big-Statistics.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-4-big-Summary.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-4-big-TOC.txt.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-5-big-CompressionInfo.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-5-big-Data.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-5-big-Digest.crc32.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-5-big-Filter.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-5-big-Index.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-5-big-Statistics.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-5-big-Summary.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_mc-5-big-TOC.txt.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.3.252_schema.cql.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_manifest.json.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-3-big-CompressionInfo.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-3-big-Data.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-3-big-Digest.crc32.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-3-big-Filter.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-3-big-Index.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-3-big-Statistics.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-3-big-Summary.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-3-big-TOC.txt.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-4-big-CompressionInfo.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-4-big-Data.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-4-big-Digest.crc32.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-4-big-Filter.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-4-big-Index.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-4-big-Statistics.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-4-big-Summary.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-4-big-TOC.txt.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-5-big-CompressionInfo.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-5-big-Data.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-5-big-Digest.crc32.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-5-big-Filter.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-5-big-Index.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-5-big-Statistics.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-5-big-Summary.db.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_mc-5-big-TOC.txt.lzo...
Decompressing /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc/10.0.4.252_schema.cql.lzo...
Download finished.00.00)
Running sstableloader...
invoking: /usr/bin/sstableloader --nodes 10.0.2.253 -v                 /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc
sh: 1: /usr/bin/sstableloader: not found
Restore completed.

Notice the last bit:

invoking: /usr/bin/sstableloader --nodes 10.0.2.253 -v /tmp/restore_cassandra/test/sampledata-40fd1760566a11e78fcbb5651c2815dc sh: 1: /usr/bin/sstableloader: not found

If I login into one of the node I can confirm sstableloader is actually available in the expected path:

ssh -A ubuntu@cassandra-0

ubuntu@cassandra-0:~$ which sstableloader
/usr/bin/sstableloader

Of course the test table is still empty after this.

Can you see something wrong with this approach?

Thanks in advance for the support

lmammino commented 7 years ago

Any chance you can have a quick look at this @tbarbugli @rhardouin ?

rhardouin commented 7 years ago

@lmammino sstableloader should be installed on the server where you run cassandra-snapshotter restore command line: data will be fetch from S3 then streamed via sstableloader to 10.0.2.253. 10.0.2.253 just receives data, it will not make use of sstableloader.

If you don't want this behavior you can have a look to --local to restore directly on each Cassandra node.

Does it help? Is it clear?

lmammino commented 7 years ago

Thank you @rhardouin, that makes totally sense (not sure why I didn't realize this myself 😅).

I'll try again in the following days and let you know what happens.

Thanks again for the support!

tbarbugli commented 7 years ago

@lmammino let us know how the test goes :)

regonzalo commented 6 years ago

Merge please

tbarbugli commented 6 years ago

@regonzalo @lmammino did you manage to get this to work?

jeremyjpj0916 commented 6 years ago

Reading over this repo would love restore logic to be in place as I need a neat tool to do C* backups/restores for my 3.x Cluster(but not to S3 will need to modify this code so I can just point to an arbitrary linux backup box within our network), I notice a few forks of this repo already documenting and exposing the restore calls. Might as well be in the OG repo 👍 .

opsline-ilan commented 5 years ago

@tbarbugli can this get merged? Or transferred to an active maintainer?

opsline-ilan commented 5 years ago

Just FYI, if anyone goes down a rabbit hole like I did, this tool seems to prefex db files with the base name (e.g monthly_mc_xyz.db) but not edit the manifest to reflect that. I scripted a bulk rename among other things to remove the prefix (mc_xyz.db) so the files actually restore. Seems like a bug? @tbarbugli @rhardouin