jamessanford / remote-tsdb-clickhouse

A remote writer/reader for Prometheus that stores TSDB data in ClickHouse
Apache License 2.0
28 stars 12 forks source link

Update the recommended ORDER BY for metrics.samples table #9

Closed jamessanford closed 3 months ago

jamessanford commented 3 months ago

For large datasets with large cardinality, the previous ORDER BY was slow when resolving queries like node_cpu_seconds_total{node="node999", cpu="0"} with specific time ranges.

When label cardinality is in the millions, testing shows value retrieval for specific labels is faster when ordered by metric_name, updated_at, labels compared to the previous metric_name, labels, updated_at.

Example query speeds:

SELECT metric_name, arraySort(labels) as slb, toStartOfInterval(updated_at, INTERVAL 7 second) AS t, max(value) as max_0 FROM metrics.samples WHERE t >= 1711860427 AND t <= 1711864927 AND has(labels, 'cpu=0') AND has(labels, 'node=node99') AND metric_name='node_cpu_seconds_total' GROUP BY metric_name, slb, t ORDER BY metric_name, slb, t;
[table created with] ORDER BY metric_name, labels, updated_at

600 rows in set. Elapsed: 4.131 sec. Processed 67.11 million rows, 1.46 GB (16.24 million rows/s., 354.00 MB/s.)
[table created with] ORDER BY metric_name, updated_at, labels

600 rows in set. Elapsed: 0.383 sec. Processed 66.58 million rows, 1.20 GB (173.89 million rows/s., 3.14 GB/s.)

The total row scan count is similar, but the old ORDER BY takes seconds of processing to decide which rows to scan before the scanning actually begins.

This dataset example uses generated sample data with the following cardinality:

SELECT count(distinct metric_name, labels) FROM metrics.samples WHERE updated_at >= 1711860427 AND updated_at <= 1711864927;

┌─countDistinct(metric_name, labels)─┐
│                            3548769 │ -- 3.55 million
└────────────────────────────────────┘

The sample data was generated with a script similar to this:

#!/usr/bin/env python3

import random
import socket
import struct
import sys

lookup={}
starttime=1711843200
startnode = int(sys.argv[1])
endnode = int(sys.argv[2])

for node in range(endnode-startnode):
  nodename = "node{}".format(node+startnode)
  instance = "{}:9100".format(socket.inet_ntoa(struct.pack(">L", hash(nodename)&0xFFFFFFFF)))
  for timestamp in range(0, 86400, 60):
    timeval = timestamp + starttime
    for cpuid in range(32):
      for mode in 'idle,iowait,irq,nice,softirq,steal,system,user'.split(','):
        labels = "['cpu={}','instance={}','job=node-exporter','mode={}','node={}']".format(cpuid, instance, mode, nodename)
        val = lookup[labels] = lookup.get(labels, 0) + (random.randint(0, 99) / 10)
        print("{}\tnode_cpu_seconds_total\t{}\t{}".format(timeval, labels, val))
jamessanford commented 3 months ago

This suggestion got reverted in https://github.com/jamessanford/remote-tsdb-clickhouse/pull/10