instaclustr / icarus

Sidecar for Cassandra with integrated backup / restore
https://instaclustr.com
Apache License 2.0
10 stars 5 forks source link

[BUG] icarus restore fails when dropped tables still have cassandra snapshots or backups #9

Closed rjb1971 closed 2 years ago

rjb1971 commented 2 years ago

Describe the bug We tried to restore a backup, using for both the backup and restore the icarus rest interface. But during the restore we get the error: ERROR [MutationStage-1] 2022-02-23 08:49:09,695 TruncateVerbHandler.java:44 - Error in truncation java.lang.IllegalArgumentException: Unknown keyspace/cf pair (gms.incidents_by_label_mv)

This table did exists before, but is dropped a long time ago. This table isn't described in the manifest of the backup. But on the cassandra filesystem we see several directories like: incidents_by_label_mv-2edad310a01d11eaa83cb14d6a129718 incidents_by_label_mv-2edad310a01d11eaa83cb14d6a129718/backups incidents_by_label_mv-2edad310a01d11eaa83cb14d6a129718/snapshots incidents_by_label_mv-2edad310a01d11eaa83cb14d6a129718/snapshots/1590585781812-incidents_by_label_mv incidents_by_label_mv-2edad310a01d11eaa83cb14d6a129718/snapshots/1590585781812-incidents_by_label_mv/manifest.json incidents_by_label_mv-2edad310a01d11eaa83cb14d6a129718/snapshots/1590585781812-incidents_by_label_mv/schema.cql

It looks like de restore tool tries to refresh tables which doesn't exists anymore, because the tool found a directory, with snapshots of the old table, on the file system of cassandra.

To Reproduce create a table in cassandra. Create snapshot within cassandra itself drop the table Create backup with icarus restore the backup with icarus

Expected behavior It should be able to restore the database without problems. The restore should probably detect the table doesn't exist and skip the refresh step (which truncate the table)

Versions (please complete the following information):

Additional context Complete error ERROR [MutationStage-1] 2022-02-23 08:49:09,695 TruncateVerbHandler.java:44 - Error in truncation java.lang.IllegalArgumentException: Unknown keyspace/cf pair (gms.incidents_by_label_mv) at org.apache.cassandra.db.Keyspace.getColumnFamilyStore(Keyspace.java:198) ~[apache-cassandra-3.11.6.jar:3.11.6] at org.apache.cassandra.db.TruncateVerbHandler.doVerb(TruncateVerbHandler.java:39) ~[apache-cassandra-3.11.6.jar:3.11.6] at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) [apache-cassandra-3.11.6.jar:3.11.6] at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_252] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165) [apache-cassandra-3.11.6.jar:3.11.6] at org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:137) [apache-cassandra-3.11.6.jar:3.11.6] at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:113) [apache-cassandra-3.11.6.jar:3.11.6] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_252]

Possible solution: Check if the directory contains table files (*.db) when parsing the CassandraData com.instaclustr.esop.impl.CassandraData.parse (line 296)

smiklosovic commented 2 years ago

Hi @rjb1971

you hit an interesting corner-case. If you drop a table, it will (normally) create a snapshot with prefix "dropped-" and I am already filtering these out. Check class SnapshotsLister in CassandraData class.

So the only way I see how it might slip through is that you have Cassandra configured in such a way that it will not create a snapshot (with dropped- prefix) upon dropping a table. Is this true? You can check that by looking into the configuration of your Cassandra node:

# Whether or not a snapshot is taken of the data before keyspace truncation
# or dropping of column families. The STRONGLY advised default of true 
# should be used to provide data safety. If you set this flag to false, you will
# lose data on truncation or drop.
auto_snapshot: true

If you do not drop it in such a way it will create "dropped-" snapshot and there is still some snapshot, I can not recognise that table does not exist anymore because there is nothing to make that difference on. I would have to talk to Cassandra via CQL to get that schema and not parse it by looking into tables but that is not done yet and it means that it could be used only for cases your nodes are up which would disqualify "offline" restoration paths.

If you are going to restore, there is a flag called "entities", how did you set that flag? If you restore, you need to explicitly enumerate tables or keyspaces you want to restore. I do not think it is checking you have backups for tables you are truncating.

rjb1971 commented 2 years ago

This table drop has been done a long time (about 2 years) ago and was before i joined this team. But this is what i could find: The drop is done with the following command : DROP MATERIALIZED VIEW IF EXISTS gms.incidents_by_label_mv;

I can't find any references to setting or changing the auto_snapshot property in our projects or on the cassandra nodes. It looks like it is set (at this moment) to: auto_snapshot=true ./cassandra/conf.default/cassandra.yaml:auto_snapshot: true /cassandra/logs/cassandra.log:INFO [main] 2022-02-23 12:19:27,113 Config.java:516 - Node configuration:[allocate_tokens_for_keyspace=null; authenticator=PasswordAuthenticator; authorizer=CassandraAuthorizer; auto_bootstrap=true; auto_snapshot=true; ..............

I didn't set entries. I used the following commands to restore: curl --header "Content-Type: application/json" --data '{"type":"backup", "globalRequest":"true", "storageLocation" : "ceph://cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/1", "metadataDirective":"REPLACE", "dataDirs":["/icarus/cassandra/data/data"], "skipRefreshing":"true"}' cassandra-dev00-ird:4567/operations

curl--header "Content-Type: application/json" --data '{"type":"restore", "globalRequest":"true", "dataDirs":["/icarus/cassandra/data/data"], "snapshotTag":"autosnap-1645605747", "restorationPhase":"INIT", "restorationStrategyType":"HARDLINKS", "storageLocation":"ceph://cassandra-icarus2-ird-backup-dev/cassandra_gms_dev/rc3/2", "import":{"type":"import", "sourceDir":"/icarus/tmp/"}, "resolveHostIdFromTopology":"true", "cassandraDirectory":"/icarus/cassandra/data/"}' cassandra-dev00-ird:4567/operations

smiklosovic commented 2 years ago

The only way you might have a dropped table still present on disk without dropped- snapshot is that it was either already removed or it was not created in the first place.

rjb1971 commented 2 years ago

Not sure what you're mean with ; "it was either already removed";
removed before what?

And are you saying that the Cassandra is in an inconsistent state right now? I also checked and production has the same situation. No table and still snapshot directory present without any dropped snapshots.

Not sure how to proceed from here?

smiklosovic commented 2 years ago

If that table does not exist in Cassandra (checking via cqlsh) and it is supposed to be deleted, I do not see any reason why you have these files on disk, you may just delete them?

smiklosovic commented 2 years ago

Can you please do a simple experiment for me? After you drop that table, can you check if it has "dropped-" snapshot? If it has, why that other table, which does not exist, does not have it?

rjb1971 commented 2 years ago

Table isn't in cassandra anymore, we checked that already. I can delete it but isn't there any reference in cassandra to those snapshot files?

About the experiment: i will see if i can make some time to try it. Not sure if i can do it this week.

smiklosovic commented 2 years ago

if it does not exist from cassandra point of view - in cql schemas, you can remove it, it is not referenced anywhere.

rjb1971 commented 2 years ago

OK I did make time to do the test just right now: During the test i found out "Materialized views are experimental" Can this be part of the problem? OR it isn't a table....it is a view...

This is what i did: Create table CREATE MATERIALIZED VIEW gms.rb_test AS ... SELECT label_name, label_id, incident_id FROM gms.labels_by_incident WHERE label_name IS NOT NULL AND label_id IS NOT NULL AND incident_id IS NOT NULL ... PRIMARY KEY (label_name, label_id, incident_id);

Warnings : Materialized views are experimental and are not recommended for production use.

Then I created a snapshot RBTEST

Then removed table: useradmin@cqlsh> DROP MATERIALIZED VIEW IF EXISTS gms.rb_test

And list of the files left on disk: rb_test-91e26860957111ec8761898044132165 rb_test-91e26860957111ec8761898044132165/backups rb_test-91e26860957111ec8761898044132165/snapshots rb_test-91e26860957111ec8761898044132165/snapshots/RBTEST-2022-02-24-13-03-18 rb_test-91e26860957111ec8761898044132165/snapshots/RBTEST-2022-02-24-13-03-18/manifest.json rb_test-91e26860957111ec8761898044132165/snapshots/RBTEST-2022-02-24-13-03-18/schema.cql rb_test-91e26860957111ec8761898044132165/snapshots/1645707876347-rb_test rb_test-91e26860957111ec8761898044132165/snapshots/1645707876347-rb_test/manifest.json rb_test-91e26860957111ec8761898044132165/snapshots/1645707876347-rb_test/schema.cql

smiklosovic commented 2 years ago

post here schema of gms.labels_by_incident please

rjb1971 commented 2 years ago

CREATE TABLE IF NOT EXISTS gms.labels_by_incident ( incident_id text, label_id uuid, label_name text, label_color text, label_created timestamp, label_created_by text, PRIMARY KEY ((incident_id), label_id) ) WITH CLUSTERING ORDER BY (label_id ASC);

smiklosovic commented 2 years ago

I think we found bug in Cassandra. If you drop materialized view, it will create snapshot, same as with table, but dropping table will add "dropped-" prefix, but dropping materialized view does not. I ll ask around if this is known or not.

smiklosovic commented 2 years ago

@rjb1971 it will be fixed here https://issues.apache.org/jira/browse/CASSANDRA-17415

smiklosovic commented 2 years ago

@rjb1971 it is merged, will be in 3.11.13