mattbostock commented 7 years ago

Determine a partition key, used to determine which nodes are responsible for which time series.

Trade-offs to consider

Load distribution:

Want to avoid underutilising the cluster by only having a subset of nodes doing the work at any given time.
Tension between load distribution and data locality.

Cluster size:

The choice of partition key will impact clusters differently depending on the cluster size.
Distribution becomes more important on larger clusters due to greater financial costs of under-utilised nodes.

Data locality:

CPU cost versus serialisation/network cost versus cache ratio (CPU and/or in-memory caching).
Good distribution means CPU load is distributed but incurs additional network traffic
Data locality better for pre-aggregation. Aggregation on the node answering the query may be simpler to distribute well, assuming incoming queries are well-distributed across all nodes (e.g. using a round-robin reverse proxy), though may incur caching penalty.

Candidate keys:

Metric names: metric names will vary widely, though it would not be surprising to see large groupings of individual timeseries all sharing the same metric name, suggesting that the metric names alone (not using labels) would not ensure a good distribution of timeseries across shards.
Label names: since many timeseries could use the same metric name but using different labels, the label names of a timeseries could be included in the partition key to ensure good distribution. May want to exclude label names with special meaning in Prometheus (such as le, used for histograms) from the partition key when multiple series using that label may benefit from data locality in aggregate queries.
Label values: label values, in addition to label names and the metric name, also define unique timeseries, so they should be included in the partition key to ensure good distribution between nodes.

However, users are likely to want to aggregate across multiple label values for the same label so putting distinct label values on different shards will reduce data locality and increase the number of shards that must be involved in queries. Conversely, it may help to parallelise queries by retrieving the data from multiple nodes.

Also, some label values may be queried much more frequently, so including label values in the partition key may be detrimental to good distribution of query load.

Time: queries are bounded by time so using time in the partition key is essential to allow for fast querying. Also, partitioning by time should help should ensure good distribution between nodes when metrics being ingested change over time (i.e. assume the most common metrics will change over time).

Should also consider:

Partition key as an index: without a centralised index of metric names and labels, the partition key becomes very important in determining how easily data can be queried.
Multi-tenancy: if many tenants were writing to the same cluster, and those tenants are all writing to the same metric names (e.g. Node Exporter metrics) at consistent times of day (24 hours/day), it's important to avoid storing and querying that data on the same set of nodes to serve all tenants.

Conversely, if one tenant has significantly more timeseries than the others, it could result in poor distribution of load between nodes in the cluster.

Ideas for partition keys:

Schema A: Timestamp

<salt>:<bucket_end_time_as_YYYYMMDD>

Pros:

simple; can identify nodes using only the timestamp which should reduce the amount of nodes involved in a query given the lack of centralised index
data locality; all metrics for same day stored on same nodes
given lack of a centralised index, easier to reason about which nodes are responsible for given time-series (easier to deduce cluster health)

Cons:

on any given day, only a subset of nodes in the cluster will ingest new series and given most queries will likely be for the current day, query traffic will be disproportionate to the rest of the cluster
if the cluster is flooded with high cardinality time-series on a given day (i.e. by mistake), some nodes may store a disproportionate number of time-series, though perhaps that could be mitigated by the ability to delete series

Schema B: Timestamp, metric name and label pairs

<salt>:<bucket_end_time_as_YYYYMMDD>:<metric_name>:[<label_name>,<label_name>...]

Pros:

good distribution; if the cluster is flooded with high cardinality time-series on a given day (i.e. by mistake), those time-series will be well-distributed across all nodes
can distribute work (e.g. pre-aggregation) across multiple nodes

Cons:

is there a cost associated with using all label name and label value pairs in the key?
impossible to identify nodes without the timestamp and metric name and all label name and label value pairs
data locality; if in future data is pre-aggregated locally across time-series sharing the same metric name

Schema C: Timestamp and metric name

<salt>:<bucket_end_time_as_YYYYMMDD>:<metric_name>

Pros:

Compromise between schemas A and B; distributes load across cluster but reduces the number of nodes that need querying for queries with using a metric name
Good data locality; all time-series (label pairs) for a given metric name are on same nodes

Cons:

Doesn't account for label name and label value pairs, which impacts distribution adversely
Harder to reason about which nodes store which time-series (which could be important without a centralised index)

Schema D: Timestamp with greater precision

<salt>:<bucket_end_time_as_YYYYMMDDHH>

Pros:

Same as schema A; additionally the query load is better distributed assuming most queries are for the past ~24 hours of data.

Cons:

All the cons of schema A except load is slightly better distributed for queries. Ingestion will still occur on a subset of nodes in the cluster at any point in time for a given Salt (tenant).
Need to consider how the hour in the timestamp coincides with the length of a chunk in the TSDB library and avoid wasting space by artificially closing a chunk by sending data samples to a different set of nodes.

Further ideas:

Include count of label key-value pairs in partition key; can filter to timeseries having at least the same number as pairs as specified in the query
Use locality-sensitive hashing
- *

References: https://github.com/weaveworks/cortex/blob/b14eccfa302e5a3c3b8e17f9eb1330534fc67fd7/pkg/chunk/schema.go#L68-L133 https://github.com/weaveworks/cortex/issues/298 http://opentsdb.net/docs/build/html/user_guide/backends/hbase.html

mattbostock commented 7 years ago

Going to use schema A (<salt>:<bucket_end_time_as_YYYYMMDD>) to start with. Load distribution could be greatly improved, but on small clusters (e.g. 5 nodes) this should not be a significant issue.

Keeping the partition key simple should help to keep the design simple. I can iterate on this later to improve the load distribution, at which point this issue can be re-opened.

mattbostock commented 7 years ago

137 updated the partition key to use schema B above, which has some drawbacks detailed in #174.

mattbostock / timbala

Determine partition key for shard allocation #12

Trade-offs to consider

Candidate keys:

Should also consider:

Ideas for partition keys:

Schema A: Timestamp

Schema B: Timestamp, metric name and label pairs

Schema C: Timestamp and metric name

Schema D: Timestamp with greater precision

Further ideas:

137 updated the partition key to use schema B above, which has some drawbacks detailed in #174.