instaclustr / icarus

Sidecar for Cassandra with integrated backup / restore
https://instaclustr.com
Apache License 2.0
10 stars 5 forks source link

[FEATURE] add the esop remove-backup function to the REST interface. #8

Closed rjb1971 closed 2 years ago

rjb1971 commented 2 years ago

Is your feature request related to a problem? Please describe. We are using icarus to create backups and restore functions for our cassandra database. But an important part of the backup procedure is the cleanup of old backups. We can't use the simple retention date option, to delete old backups, because we're using an S3 repository and the skiprefresh option is set to true. Therefore we can't cleanup old backups easily.

Describe the solution you'd like We would like to use the remove-backup function in esop for cleaning old backups though the REST interface of icarus, with all the options available in esop remove-backup. With this option we can remotely cleanup old backups.

Describe alternatives you've considered We are also looking if it is possible to use the esop tooling remotely. At this moment it is unclear if the esop tooling also need to be local when using the remove-backup function. Also looking at the possibility to make something by ourself, to delete them directly from the S3 repository

Additional context None.

smiklosovic commented 2 years ago

Hi @rjb1971 ,

your request makes sense to me. I think there is already the listing module in Icarus so I wonder why there is not the removal module included already. Mystery ... It should be possible to do but there are some caveats.

For example, if you are executing a removal over Icarus, you can not use file:// storage protocol for a node different from the node you are connecting to because that data would not be there. That is rather obvious. However, from what you are saying, it seems to me you already know what you are doing.

I am not sure about the other implementation details / problem yet.

Give me some time here, next week is quite probable I would get to this.

rjb1971 commented 2 years ago

Thank you

smiklosovic commented 2 years ago

hi @rjb1971

I exposed remove-backup module in Icarus. It is released as version 2.0.2.

For the lack of time I havent done any docs but it is quite easy to follow. I added a test where Icarus removes a backup which is remote and a request is sent to it so I believe you can figure it all out on your own.

Do not hesitate to hit me if you have any problems with that and tell me how it went!

Regards

rjb1971 commented 2 years ago

Hi Smiklosovic Great, that is fast. I don't think i can work on it today, but hoping i can start with it on Monday. I will keep you posted

rjb1971 commented 2 years ago

Did my first test, but failed. I tried to remove an old backup. I didn't get any errors but the manifest wasn't removed from our S3 repository. But I found out that more operations didn't work. Our old Backup URL didn't work anymore and tried the list operation and got an access denied error. So i looked at the version we used before and found out is was a very old version 1.1.1 And according the esop documentation version 2.X isn't compatible with 1.X

Did i understand correctly that when we use 2.0.2 we can't restore our current (1.1.1) backups anymore?

It looks like we have to make some deployment changes to make version 2.0.2 work in our environment, before i can test the cleanup again. I quickly looked for some documentation about upgrading from 1.X to 2.X but couldn't find any . Is it correct that such a document doesn't exist?

Oh and not sure, if this is a problem : but when icarus starts we get this warning: Feb 21, 2022 1:53:13 PM org.glassfish.jersey.internal.Errors logErrors WARNING: The following warnings have been detected: WARNING: Parameter 1 of type java.util.Set<java.lang.Class<? extends com.instaclustr.operations.Operation>> from public java.util.Collection com.instaclustr.sidecar.operations.OperationsResource.getOperations(java.util.Set<java.lang.Class<? extends com.instaclustr.operations.Operation>>,java.util.Set<com.instaclustr.operations.Operation$State>) is not resolvable to a concrete type.

smiklosovic commented 2 years ago

hi @rjb1971

yes, that update document very likely just does not exist. This project is released as open source basically because of our courtesy. We have paying customers who are using this and any requests and problems they have are solved internally and released here, more or less.

That warning is just warning, nothing to worry about.

I think that you should upgrade to 2.0.2 and try again. 1.x and 2.x is the most probably not compatible, yes.

rjb1971 commented 2 years ago

FYI: backup (and list) is working again. Needed 2 changes to make 2.0.2 work: Needed to add the environment variable AWS_REGION even when it is empty. And for the backup URL changing cassandraDirectory => dataDirs (Note dataDirs is one level deeper than the original directory)

smiklosovic commented 2 years ago

Yes, dataDir is in fact array, that comes from the need of our customers to deal with backups of a node which has multiple data directories for their tables. As far as I know we are the only solution on the market which is able to backup from and restore to a node with multi data dirs setup and with Icarus you can even do so while your node is fully up. Who does that? :)

rjb1971 commented 2 years ago

Looks like deleting backup doesn't work. I get no errors and the result is completed, but the manifest is still present in our S3 repository. I tried to add or remove some options, but resulted in the same responses. Worrying is that I get the same response when i try to delete a none existing backup!

We get 2 log lines when i try to delete a backup: [] - 11:14:30.190 [jdk-http-server-0] INFO c.i.s.o.OperationsResource - Received operation RemoveBackupRequest{backupName=autosnap-1645523349-73919840-436c-3d25-a9d8-42659f4e5722-1645523358989, dry=false, skipNodeCoordinatesResolution=true, olderThan=0 s, cacheDir=?/.esop, globalRemoval=false, dcs=[]} [] - 11:14:31.375 [pool-4-thread-1] INFO c.i.e.i.r.RemoveBackupOperation - Looking for backups to delete for node cassandra_dev/rc3/1

Striking is that it logs globalRemoval=false Even when i set it to the value true.

Commands and responses: $curl --header "Content-Type: application/json" --data '{"type":"remove-backup", "globalRequest":true, "storageLocation" : "ceph://cassandra-icarus2-backup-dev/cassandra_dev/rc3/1", "backupName":"autosnap-1645523349-73919840-436c-3d25-a9d8-42659f4e5722-1645523358989", "skipNodeCoordinatesResolution":true}' cassandra-dev00:4567/operations { "id" : "59047937-8d03-439b-bce0-5650ac2ae837", "creationTime" : "2022-02-22T11:14:30.379Z", "state" : "RUNNING", "errors" : [ ], "progress" : 0.0, "startTime" : "2022-02-22T11:14:30.384Z", "type" : "remove-backup", "storageLocation" : "ceph://cassandra-icarus2-backup-dev/cassandra_dev/rc3/1", "insecure" : false, "skipBucketVerification" : false, "retry" : { "interval" : 10, "strategy" : "LINEAR", "maxAttempts" : 3, "enabled" : false }, "backupName" : "autosnap-1645523349-73919840-436c-3d25-a9d8-42659f4e5722-1645523358989", "dry" : false, "skipNodeCoordinatesResolution" : true, "olderThan" : { "value" : 0, "unit" : "SECONDS" }, "cacheDir" : "?/.esop", "removeOldest" : false } $ curl cassandra-dev00:4567/operations/59047937-8d03-439b-bce0-5650ac2ae837 { "id" : "59047937-8d03-439b-bce0-5650ac2ae837", "creationTime" : "2022-02-22T11:14:30.379Z", "state" : "COMPLETED", "errors" : [ ], "progress" : 1.0, "startTime" : "2022-02-22T11:14:30.384Z", "type" : "remove-backup", "storageLocation" : "ceph://cassandra-icarus2-backup-dev/cassandra_dev/rc3/1", "insecure" : false, "skipBucketVerification" : false, "retry" : { "interval" : 10, "strategy" : "LINEAR", "maxAttempts" : 3, "enabled" : false }, "backupName" : "autosnap-1645523349-73919840-436c-3d25-a9d8-42659f4e5722-1645523358989", "dry" : false, "skipNodeCoordinatesResolution" : true, "olderThan" : { "value" : 0, "unit" : "SECONDS" }, "cacheDir" : "?/.esop", "removeOldest" : false, "completionTime" : "2022-02-22T11:14:31.382Z" }

smiklosovic commented 2 years ago

Why is your "cacheDir" with question mark? cacheDir: "?/.esop"

Also, would you mind to take a look into the logs of Icarus itself?

Also, try to list all backups for one node and look into ~/.esop what manifests it downloaded.

Look into "--skip-download" flag for Esop (same is for listing via Icarus in json, skipDownload), that should be false. If it is false and you list your backups, it should download manfest into ~/.esop, do you have these files there?

smiklosovic commented 2 years ago

also try to set 'skipNodeCoordinatesResolution' to false, is this really the path where your backups are? .../cassandra_dev/rc3/1.

I would expect that they are under ".../cassandra-dev/rc3/_uuid_of_that_node". How are you doing backups?

Basically, you need to list it all first into ~/.esop, it will then try to backup all manifests for backups there are and upon deletion, it will "list" the nodes in that ~/.esop dir, it will parse the manifets and it will remove the files (which are located remotely). So you need to list it first and all your manifests need to be found locally before removing backups.

It is done via this cache because it is quite hard to "list" the stuff via APIs of these cloud providers and doing anything more complex, filtering and so on is very awkward. So that is the reason all stuff is just downloaded into the cache first and then you need to point it on that to remove stuff via it in the cloud.

Try also to no skip that node resolution.

rjb1971 commented 2 years ago

Why is your "cacheDir" with question mark? cacheDir: "?/.esop"

Was wondering the same thing..it is the default. I had to change it (in request) to make it work for the list function. Because you said it need those files to delete, I have also added this to the delete-backup and deleted the name resolution option. And it delete now one shapshot, but i have 3 nodes. So 2 manifests are still present on the S3 and in the esop cache. I expected that all 3 would be deleted.

What do i still miss here? Maybe something to do with it logs globalRemoval=false Even when i set it to the value true?

smiklosovic commented 2 years ago

If you dont set that dir yourself, it is evaluated to this:

this.cacheDir = (cacheDir == null) ? Paths.get(System.getProperty("user.home"), ".esop") : cacheDir;

So your "user.home" is "?".

Good so it is deleting. That is progress. Can you send me backup request please?

Are you able to read basic Java if I try to explain what it does?

If you set "global request", it will do this:

private List<StorageLocation> getStorageLocations(final StorageInteractor restorer) throws Exception {
    if (request.globalRemoval) {
        return restorer.listNodes(request.dcs);
    } else {
        return Collections.singletonList(request.storageLocation);
    }
}

if it is global, it will list all nodes (in ~/.esop) (you can further specify what dc's you are interested in, if dcs flag is not used, all dcs will be treated).

That method is called here:

        for (final StorageLocation nodeLocation : getStorageLocations(interactor)) {
            logger.info("Looking for backups to delete for node {}", nodeLocation.nodePath());
            interactor.setStorageLocation(nodeLocation);
            request.storageLocation = nodeLocation;

            if (!reportOptional.isPresent()) {
                logger.info("No backups found for {}", nodeLocation.nodePath());
                continue;
            }

So it will list all your local stuff and based on that it will try to remove them all remotely after additional logic. So it is important to see what it logs. Do you have access to that?

Can you also please send me tree or file / dir structure of your ~/.esop directory?

smiklosovic commented 2 years ago

I will have more time for this tomorrow, I ll do more on-hands debugging for you and get back to you.

rjb1971 commented 2 years ago

I think the problem is that the global request parameter isn't copied from the url request to the java object RemoveBackupRequest, because it isn't present in the constructor of RemoveBackupRequest.

But in case you still need the requested info: The output of the list command caused the log to hold/freeze. Not sure why. If you still need it i can retry tomorrow.

Command used to backup: curl --header "Content-Type: application/json" --data '{"type":"backup", "globalRequest":"true", "storageLocation" : "ceph://cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/1", "metadataDirective":"REPLACE", "dataDirs":["/icarus/cassandra/data/data"], "skipRefreshing":"true"}' cassandra-dev00-ird:4567/operations

Command used to list: curl --header "Content-Type: application/json" --data '{"type":"list", "storageLocation":"ceph://cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/982ab74c-cd42-4885-a986-b61c3eb186b4", "skipNodeCoordinatesResolution":true, "humanUnits":true, "cacheDir":"/icarus/.esop"}' cassandra-dev00-ird:4567/operations

Command used to delete curl --header "Content-Type: application/json" --data '{"type":"remove-backup", "globalRequest":true, "storageLocation" : "ceph://cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/1", "backupName":"autosnap-1645523349-73919840-436c-3d25-a9d8-42659f4e5722-1645523358989", "cacheDir":"/icarus/.esop" }' cassandra-dev00-ird:4567/operations

Files in my .esop cache: cassandra-icarus2-ird-backup-dev/ cassandra-icarus2-ird-backup-dev/cassandra_gms_dev cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3 cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/22a5a1e9-220c-4703-a8ed-a8a9ae35f088 cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/22a5a1e9-220c-4703-a8ed-a8a9ae35f088/manifests cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/22a5a1e9-220c-4703-a8ed-a8a9ae35f088/manifests/autosnap-1645454485-73919840-436c-3d25-a9d8-42659f4e5722-1645454494968.json cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/22a5a1e9-220c-4703-a8ed-a8a9ae35f088/manifests/autosnap-1645455332-73919840-436c-3d25-a9d8-42659f4e5722-1645455341302.json cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/22a5a1e9-220c-4703-a8ed-a8a9ae35f088/manifests/autosnap-1645488002-73919840-436c-3d25-a9d8-42659f4e5722-1645488011370.json cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/26ce2e8b-3304-410e-aa21-46cc59670e7a cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/26ce2e8b-3304-410e-aa21-46cc59670e7a/manifests cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/26ce2e8b-3304-410e-aa21-46cc59670e7a/manifests/autosnap-1645454485-73919840-436c-3d25-a9d8-42659f4e5722-1645454494968.json cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/26ce2e8b-3304-410e-aa21-46cc59670e7a/manifests/autosnap-1645455332-73919840-436c-3d25-a9d8-42659f4e5722-1645455341302.json cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/26ce2e8b-3304-410e-aa21-46cc59670e7a/manifests/autosnap-1645488002-73919840-436c-3d25-a9d8-42659f4e5722-1645488011370.json cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/26ce2e8b-3304-410e-aa21-46cc59670e7a/manifests/autosnap-1645523349-73919840-436c-3d25-a9d8-42659f4e5722-1645523358989.json cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/982ab74c-cd42-4885-a986-b61c3eb186b4 cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/982ab74c-cd42-4885-a986-b61c3eb186b4/manifests cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/982ab74c-cd42-4885-a986-b61c3eb186b4/manifests/autosnap-1645454485-73919840-436c-3d25-a9d8-42659f4e5722-1645454494968.json cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/982ab74c-cd42-4885-a986-b61c3eb186b4/manifests/autosnap-1645455332-73919840-436c-3d25-a9d8-42659f4e5722-1645455341302.json cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/982ab74c-cd42-4885-a986-b61c3eb186b4/manifests/autosnap-1645488002-73919840-436c-3d25-a9d8-42659f4e5722-1645488011370.json cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/982ab74c-cd42-4885-a986-b61c3eb186b4/manifests/autosnap-1645523349-73919840-436c-3d25-a9d8-42659f4e5722-1645523358989.json

smiklosovic commented 2 years ago

Hah, good catch! Yes, it seems to be not propagated. I'll give it a shot tomorrow. Thanks.

smiklosovic commented 2 years ago

@rjb1971 try this jar

https://oss.sonatype.org/content/repositories/snapshots/com/instaclustr/icarus/2.0.3-SNAPSHOT/icarus-2.0.3-20220223.115418-1.jar

I will release that once it is ok.

rjb1971 commented 2 years ago

Same result only one node was deleted the other two left unchanged. I do see that "globalRequest" = true So that is an improvement

~$curl --header "Content-Type: application/json" --data '{"type":"list", "storageLocation":"ceph://cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/982ab74c-cd42-4885-a986-b61c3eb186b4", "skipNodeCoordinatesResolution":true, "humanUnits":true, "json":true}' cassandra-dev00-ird:4567/operations { "id" : "9feea44a-0fb3-4f20-b106-eef27a2e4238", "creationTime" : "2022-02-23T12:32:11.128Z", "state" : "RUNNING", "errors" : [ ], "progress" : 0.0, "startTime" : "2022-02-23T12:32:11.133Z", "type" : "list", "storageLocation" : "ceph://cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/982ab74c-cd42-4885-a986-b61c3eb186b4", "insecure" : false, "skipBucketVerification" : false, "retry" : { "interval" : 10, "strategy" : "LINEAR", "maxAttempts" : 3, "enabled" : false }, "json" : true, "skipNodeCoordinatesResolution" : true, "humanUnits" : true, "simpleFormat" : false, "fromTimestamp" : 9223372036854775807, "lastN" : 0, "skipDownload" : false, "cacheDir" : "/icarus/.esop", "toRequest" : false, "concurrentConnections" : 1 }

~$ curl cassandra-dev00-ird:4567/operations/9feea44a-0fb3-4f20-b106-eef27a2e4238 { "id" : "9feea44a-0fb3-4f20-b106-eef27a2e4238", "creationTime" : "2022-02-23T12:32:11.128Z", "state" : "COMPLETED", "errors" : [ ], "progress" : 1.0, "startTime" : "2022-02-23T12:32:11.133Z", "type" : "list", "storageLocation" : "ceph://cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/982ab74c-cd42-4885-a986-b61c3eb186b4", "insecure" : false, "skipBucketVerification" : false, "retry" : { "interval" : 10, "strategy" : "LINEAR", "maxAttempts" : 3, "enabled" : false }, "json" : true, "skipNodeCoordinatesResolution" : true, "humanUnits" : true, "simpleFormat" : false, "fromTimestamp" : 9223372036854775807, "lastN" : 0, "skipDownload" : false, "cacheDir" : "/icarus/.esop", "toRequest" : false, "concurrentConnections" : 1, "completionTime" : "2022-02-23T12:32:16.002Z" }

~$curl --header "Content-Type: application/json" --data '{"type":"remove-backup", "globalRequest":true, "storageLocation" : "ceph://cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/1", "backupName":"autosnap-1645540428-73919840-436c-3d25-a9d8-42659f4e5722-1645540437983", "cacheDir":"/icarus/.esop" }' cassandra-dev00-ird:4567/operations { "id" : "eedc9cb5-a53a-4822-a4e3-0a7092b1e4c7", "creationTime" : "2022-02-23T12:35:17.105Z", "state" : "RUNNING", "errors" : [ ], "progress" : 0.0, "startTime" : "2022-02-23T12:35:17.108Z", "type" : "remove-backup", "storageLocation" : "ceph://cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/1", "insecure" : false, "skipBucketVerification" : false, "retry" : { "interval" : 10, "strategy" : "LINEAR", "maxAttempts" : 3, "enabled" : false }, "backupName" : "autosnap-1645540428-73919840-436c-3d25-a9d8-42659f4e5722-1645540437983", "dry" : false, "skipNodeCoordinatesResolution" : false, "olderThan" : { "value" : 0, "unit" : "SECONDS" }, "cacheDir" : "/icarus/.esop", "removeOldest" : false, "globalRequest" : true }

~$ curl cassandra-dev00-ird:4567/operations/eedc9cb5-a53a-4822-a4e3-0a7092b1e4c7 { "id" : "eedc9cb5-a53a-4822-a4e3-0a7092b1e4c7", "creationTime" : "2022-02-23T12:35:17.105Z", "state" : "COMPLETED", "errors" : [ ], "progress" : 1.0, "startTime" : "2022-02-23T12:35:17.108Z", "type" : "remove-backup", "storageLocation" : "file:///icarus/.esop/cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/26ce2e8b-3304-410e-aa21-46cc59670e7a", "insecure" : false, "skipBucketVerification" : false, "retry" : { "interval" : 10, "strategy" : "LINEAR", "maxAttempts" : 3, "enabled" : false }, "backupName" : "autosnap-1645540428-73919840-436c-3d25-a9d8-42659f4e5722-1645540437983", "dry" : false, "skipNodeCoordinatesResolution" : false, "olderThan" : { "value" : 0, "unit" : "SECONDS" }, "cacheDir" : "/icarus/.esop", "removeOldest" : false, "globalRequest" : true, "completionTime" : "2022-02-23T12:35:22.582Z" }

rjb1971 commented 2 years ago

Is there some progress on this feature implementation? ( It raises expectations, when you reply so fast normally :-D)

smiklosovic commented 2 years ago

I will get back to it soon, I am just coping with the disappointment it doesnt work already.

smiklosovic commented 2 years ago

@rjb1971 I am on it

smiklosovic commented 2 years ago

@rjb1971 i fixed all protocols but s3. I will finish it very soon.

smiklosovic commented 2 years ago

Hi @rjb1971

grab this JAR: https://oss.sonatype.org/content/repositories/snapshots/com/instaclustr/icarus/2.0.3-SNAPSHOT/icarus-2.0.3-20220310.075129-2.jar

Bodies I used:

I had 3 nodes in one DC, I sent this to one Icarus:

{
  "type": "backup",
  "storageLocation": "s3://myspecialbucket",
  "snapshotTag": "backup1",
  "globalRequest": true,
  "createMissingBucket": true,
  "skipBucketVerification": false,
  "entities": "abc",
  "dataDirs": [ "/var/lib/cassandra/data" ]
}

I listed it like:

esop list --sl s3://myspecialbucket/Test-Cluster/dc1/6dc64377-0092-4074-8cec-fabdd1e0c499

After this, you should have all manifets in ~/.esop

I removed it like:

{
  "type":"remove-backup",
  "globalRequest":true,
  "storageLocation" : "s3://myspecialbucket/Test-Cluster/dc1/6dc64377-0092-4074-8cec-fabdd1e0c499",
  "backupName":"backup1-9b565f13-5d0b-3842-aefe-64462e9d83c5-1646866378959",
  "dry": false
}

There is no "skipNodeCoordinatesResolution" anymore, if you notice. Instead of that, there is "resolveNodes" (--resolve-nodes) which is by default false when not specified. In the above examples, you see it is "false" (for listing and removing) as it is not used, that is because in storageLocation there is full path.

Try it with "dry": true first to see what it goes to delete beforehand.

I ll cut new releases (esop too) if you are ok with that.

rjb1971 commented 2 years ago

Thank you, Only i don't have the time to test it this week. I hope i can start testing it Monday. I will keep you posted.

rjb1971 commented 2 years ago

Good news. I tested the remove-backup functionality and it works! After the remove-backup call, the metafile was removed on all 3 node directories on the S3 repository and from the cache directory. I didn't check the data files, but i'm not even sure if there was anything to delete.

I also tried the olderThan option and that works too.

Great job and Thanks again.

smiklosovic commented 2 years ago

I love to hear that. I am closing this one.