This is an umbrella task for working towards distributed queries with sybil as an aggregator.
Steps to get there
Deployment
setting up docker / k8s for deploying sybil + a small network wrapper for ingesting and querying
Ingestion
scribe (or kafka) pipes + ingestors
Aggregation
aggregation of results from leaf nodes
sharding + recovery of downed nodes
client for querying from master aggregator
Pruning
distributed pruning of data - should be a histogram query on the time + size fields followed by a trim command: "delete all data before X time", where X is deduced from the histogram query.
Current Problem Areas
TBD
Fixed Areas
assumes same min/max values for int columns, so histogram bins are even sized
on one leaf, but may be different across leaves. need to switch to more
mergeable histograms when sending result from leaf -> aggregator (like t-digest)
large cardinality + count distinct: currently implemented as len(Results). obviously
not going to work for high cardinality
large cardinality time series: if we make time series for top 10 on a leaf, we
still need to send a buffer of extra series just in case to the aggregator.
Distributed sybil
This is an umbrella task for working towards distributed queries with sybil as an aggregator.
Steps to get there
Deployment
Ingestion
Aggregation
Pruning
Current Problem Areas
Fixed Areas
assumes same min/max values for int columns, so histogram bins are even sized on one leaf, but may be different across leaves. need to switch to more mergeable histograms when sending result from leaf -> aggregator (like t-digest)
large cardinality + count distinct: currently implemented as len(Results). obviously not going to work for high cardinality
large cardinality time series: if we make time series for top 10 on a leaf, we still need to send a buffer of extra series just in case to the aggregator.