Schedule automatic Cassandra backup

waggle-sensor / beehive-server

Waggle cloud software for aggregation, storage and analysis of sensor data from Waggle nodes.

13 stars 17 forks source link

Schedule automatic Cassandra backup #29

Open seanshahkarami opened 6 years ago

seanshahkarami commented 6 years ago

Since we're about to start a lot of work on beehive, we should make sure we have a Cassandra backup process in place.

I went ahead and built a tool to pull datasets, we just need to schedule it and have a place to keep the backups: https://github.com/waggle-sensor/beehive-server/tree/master/data-exporter

The current missing half is having a complementary script to do a restore, but at least we have the raw data available now.

seanshahkarami commented 6 years ago

The main thing blocking this is choosing a reliable storage location. One interesting point is, if we're ready to consider S3 as a possible redundancy location, we automatically get an access controlled, directory-like interface to the data stored there. In other words, we could go ahead and start pointing some people to this for pulling datasets.

seanshahkarami commented 6 years ago

After thinking this over last night, I don't think this is right thing to have for the real deployment. It was more of a emergency reaction to how beehive1 is currently deployed. Cassandra is setup using a single node, so you don't get any of Cassandra's resilience guarantees during failure...

Cassandra is designed specifically to use replication and eventual consistency between a cluster nodes. In production, you'd have a number of nodes running in a cluster so you can drop a certain number of nodes at anytime and still continue running without data loss. Building on top of that reliability feature is the right way to go, if we're using Cassandra.

seanshahkarami commented 6 years ago

Turns out clustering works beautifully in the example I tried... It only took 10 minutes to setup a 3 node cluster on my own machine and load it with some test data. Using a keyspace with a replication factor of 2, things worked as expected - any single node could be taken completely offline and I still had access to the entire dataset. Just something to think about for production deployment...