waggle-sensor / beehive-server

Waggle cloud software for aggregation, storage and analysis of sensor data from Waggle nodes.
13 stars 17 forks source link

Layout expected queries and reports we'd like to generate #36

Open seanshahkarami opened 6 years ago

seanshahkarami commented 6 years ago

Having a clear idea of what kind of queries and reports we'd like to extract from our databases is crucial to knowing how to organizing them. This impacts a number of things I'll add in the comments.

seanshahkarami commented 6 years ago

Which data stores do we need? Cassandra? MySQL? Elasticsearch? In particular, do we even need MySQL if the other two can cover all our use cases? You could imagine using Cassandra as our data and configuration warehouse and Elasticsearch providing all the searchability and analytics.

seanshahkarami commented 6 years ago

How do we organize Cassandra tables? Cassandra is very sensitive to how you choose your partition / primary keys, particularly since there's not really a good concept of joins or building additional indices. This often means you need to design a table for a particular query, even if it means duplicating data.

Here's a concrete example: Suppose we want to support both bulk (daily) data pulls and efficient viewing into the last 72 hours of data from a particular node.

We may keep a table partitioned for each node-id+date, as we are now. In addition, we can create a per-node "rolling window" table of recent data partitioned by node-id in a "time sliceable" way and where entries have a TTL of 72 hours. Then, our loader just inserts a copy of the data into both.