waggle-sensor / beehive-server

Waggle cloud software for aggregation, storage and analysis of sensor data from Waggle nodes.
13 stars 17 forks source link

Investigate options for bulk data export #25

Closed seanshahkarami closed 6 years ago

seanshahkarami commented 6 years ago

I think the simplest way to do this without having to build and significantly change any other layers on beehive is to exposed Cassandra locally within beehive and add an "exporter" role who can only do a select on specified data tables. (I think the last part is important even to just prevent us from making a mistake. You don't want an exporter to accidentally destroy a table!)

This would allow us to write a couple special purpose tools with good performance to do things like bulk backups and exports.

This could even be scheduled to periodically batch, compress and store the data on a mass data store like S3 daily.

seanshahkarami commented 6 years ago

Since I was a little concerned about making sure we're able to do this for backups ASAP, I went ahead and added two tools to:

https://github.com/waggle-sensor/beehive-server/tree/master/data-exporter

export exports all the datasets from a specific node exportall exports all the datasets

These export datasets to CSV files in data/*node_id*/*date*.csv.

(These will end up as one tool...I just wrote export as a quick prototype.)

seanshahkarami commented 6 years ago

That was faster than I expected! Doing a full export this way took about 15 minutes. We just need a good place to keep the data. It's just under 5 GB uncompressed, so space isn't really an issue.

seanshahkarami commented 6 years ago

As another data point, exporting all of the new Panasonic node's data took about 7 seconds.