Open mattbostock opened 7 years ago
Going to use schema A (<salt>:<bucket_end_time_as_YYYYMMDD>
) to start with. Load distribution could be greatly improved, but on small clusters (e.g. 5 nodes) this should not be a significant issue.
Keeping the partition key simple should help to keep the design simple. I can iterate on this later to improve the load distribution, at which point this issue can be re-opened.
Determine a partition key, used to determine which nodes are responsible for which time series.
Trade-offs to consider
Load distribution:
Cluster size:
Data locality:
Candidate keys:
Metric names: metric names will vary widely, though it would not be surprising to see large groupings of individual timeseries all sharing the same metric name, suggesting that the metric names alone (not using labels) would not ensure a good distribution of timeseries across shards.
Label names: since many timeseries could use the same metric name but using different labels, the label names of a timeseries could be included in the partition key to ensure good distribution. May want to exclude label names with special meaning in Prometheus (such as
le
, used for histograms) from the partition key when multiple series using that label may benefit from data locality in aggregate queries.Label values: label values, in addition to label names and the metric name, also define unique timeseries, so they should be included in the partition key to ensure good distribution between nodes.
However, users are likely to want to aggregate across multiple label values for the same label so putting distinct label values on different shards will reduce data locality and increase the number of shards that must be involved in queries. Conversely, it may help to parallelise queries by retrieving the data from multiple nodes.
Also, some label values may be queried much more frequently, so including label values in the partition key may be detrimental to good distribution of query load.
Should also consider:
Partition key as an index: without a centralised index of metric names and labels, the partition key becomes very important in determining how easily data can be queried.
Multi-tenancy: if many tenants were writing to the same cluster, and those tenants are all writing to the same metric names (e.g. Node Exporter metrics) at consistent times of day (24 hours/day), it's important to avoid storing and querying that data on the same set of nodes to serve all tenants.
Conversely, if one tenant has significantly more timeseries than the others, it could result in poor distribution of load between nodes in the cluster.
Ideas for partition keys:
Schema A: Timestamp
Pros:
Cons:
Schema B: Timestamp, metric name and label pairs
Pros:
Cons:
Schema C: Timestamp and metric name
Pros:
Cons:
Schema D: Timestamp with greater precision
Pros:
Cons:
Further ideas:
Include count of label key-value pairs in partition key; can filter to timeseries having at least the same number as pairs as specified in the query
Use locality-sensitive hashing
References: https://github.com/weaveworks/cortex/blob/b14eccfa302e5a3c3b8e17f9eb1330534fc67fd7/pkg/chunk/schema.go#L68-L133 https://github.com/weaveworks/cortex/issues/298 http://opentsdb.net/docs/build/html/user_guide/backends/hbase.html