Modelling Prometheus with two tables (`values` and `labels`) works an order of magnitude slower than Prometheus on a basic query

leventov commented 3 years ago

The design with two tables: values and labels is suggested here, in the "Pg_Prometheus" section.

Schema

The part of our implementation of this design, corresponding to the "values" table in the documentation:

CREATE TABLE IF NOT EXISTS series_values (
    time TIMESTAMPTZ NOT NULL, 
    value DOUBLE PRECISION NOT NULL, 
    series_id INTEGER NOT NULL,
    seq BIGSERIAL
);

SELECT create_hypertable('series_values', 'time');

ALTER TABLE series_values SET (
  timescaledb.compress,
  timescaledb.compress_segmentby = 'series_id',
  timescaledb.compress_orderby = 'time DESC, seq DESC'
);

CREATE INDEX IF NOT EXISTS series_values_series_id_idx ON series_values USING BTREE (series_id, time desc);

Our data has about 250 unique series, all with 1 data point per second, i. e. about 250 data points per second in total.

Problem no. 1 - Hash Join when selecting just 190 indexed series_id is suboptimal

EXPLAIN ANALYZE SELECT series_id, avg("value") from series_values where series_id IN (select id from series where metric = 'some_metric_name') and time between '2020-09-19 09:00:00' and '2020-09-19 11:00:00' group by series_id;

                                                                                                         QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=65310.35..65386.85 rows=200 width=12) (actual time=24017.786..24023.399 rows=192 loops=1)
   Group Key: _hyper_1_40_chunk.series_id
   ->  Gather Merge  (cost=65310.35..65381.35 rows=600 width=36) (actual time=24017.712..24106.440 rows=768 loops=1)
         Workers Planned: 3
         Workers Launched: 3
         ->  Sort  (cost=64310.31..64310.81 rows=200 width=36) (actual time=23901.632..23901.857 rows=192 loops=4)
               Sort Key: _hyper_1_40_chunk.series_id
               Sort Method: quicksort  Memory: 42kB
               Worker 0:  Sort Method: quicksort  Memory: 42kB
               Worker 1:  Sort Method: quicksort  Memory: 42kB
               Worker 2:  Sort Method: quicksort  Memory: 42kB
               ->  Partial HashAggregate  (cost=64300.67..64302.67 rows=200 width=36) (actual time=23892.381..23893.048 rows=192 loops=4)
                     Group Key: _hyper_1_40_chunk.series_id
                     ->  Hash Join  (cost=15.29..62614.73 rows=337187 width=12) (actual time=5.554..22098.783 rows=325488 loops=4)
                           Hash Cond: (_hyper_1_40_chunk.series_id = series.id)
                           ->  Parallel Append  (cost=0.43..61307.99 rows=484707 width=12) (actual time=4.074..20498.857 rows=349109 loops=4)
                                 ->  Parallel Index Scan using _hyper_1_40_chunk_series_values_time_idx on _hyper_1_40_chunk  (cost=0.43..58884.45 rows=484707 width=12) (actual time=4.070..19998.534 rows=349109 loops=4)
                                       Index Cond: (("time" >= '2020-09-19 09:00:00+00'::timestamp with time zone) AND ("time" <= '2020-09-19 11:00:00+00'::timestamp with time zone))
                           ->  Hash  (cost=12.45..12.45 rows=192 width=4) (actual time=1.217..1.218 rows=192 loops=4)
                                 Buckets: 1024  Batches: 1  Memory Usage: 10kB
                                 ->  Seq Scan on series  (cost=0.00..12.45 rows=192 width=4) (actual time=0.143..0.865 rows=192 loops=4)
                                       Filter: (metric = 'some_metric_name'::text)
                                       Rows Removed by Filter: 84
 Planning Time: 3.240 ms
 Execution Time: 24109.908 ms

Query planner decided to do a Hash Join with just 192 rows from the series table, which doesn't make sense.

When I replace the nested SELECT query with series_id IN (21,22,...212), i. e. literally all the selected values, Timescale does the same (I don't post the query and the result because the query string is very long).

But I can make Timescale to do an Index Scan instead of a Hash Join by using a series_id between 21 and 212 clause. This is 40% faster:

EXPLAIN ANALYZE SELECT series_id, avg("value") from series_values where series_id BETWEEN 21 and 212 and time between '2020-09-19 09:00:00' and '2020-09-19 11:00:00' group by series_id;
                                                                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=66793.95..66870.45 rows=200 width=12) (actual time=14220.642..14227.494 rows=192 loops=1)
   Group Key: _hyper_1_40_chunk.series_id
   ->  Gather Merge  (cost=66793.95..66864.95 rows=600 width=36) (actual time=14220.567..14299.768 rows=768 loops=1)
         Workers Planned: 3
         Workers Launched: 3
         ->  Sort  (cost=65793.91..65794.41 rows=200 width=36) (actual time=14098.952..14099.158 rows=192 loops=4)
               Sort Key: _hyper_1_40_chunk.series_id
               Sort Method: quicksort  Memory: 42kB
               Worker 0:  Sort Method: quicksort  Memory: 42kB
               Worker 1:  Sort Method: quicksort  Memory: 42kB
               Worker 2:  Sort Method: quicksort  Memory: 42kB
               ->  Partial HashAggregate  (cost=65784.27..65786.27 rows=200 width=36) (actual time=14096.069..14096.652 rows=192 loops=4)
                     Group Key: _hyper_1_40_chunk.series_id
                     ->  Parallel Append  (cost=0.43..63525.22 rows=451809 width=12) (actual time=4.246..12750.656 rows=325488 loops=4)
                           ->  Parallel Index Scan using _hyper_1_40_chunk_series_values_time_idx on _hyper_1_40_chunk  (cost=0.43..61266.18 rows=451809 width=12) (actual time=4.241..12429.488 rows=325488 loops=4)
                                 Index Cond: (("time" >= '2020-09-19 09:00:00+00'::timestamp with time zone) AND ("time" <= '2020-09-19 11:00:00+00'::timestamp with time zone))
                                 Filter: ((series_id >= 21) AND (series_id <= 212))
                                 Rows Removed by Filter: 23621
 Planning Time: 11.483 ms
 Execution Time: 14304.284 ms

Problem no. 2 (deeper) - data layout is inefficient

If you look carefully at the plan of the query above (taking 14 seconds), you realise that it's still utterly inefficient for the data problem at hand. Data points belonging to the same series_id are about 250 rows away from each other, all interleaved in the main table (because there are about 250 series, and they all have the same frequency of 1 second). So the Index Scan is much closer to random access than to sequential data access to the columns in the series_values table.

Prometheus does the same query on the same data in just 1.5 seconds, i. e. an order of magnitude faster. I this is because Prometheus always groups the data points by the metric. Timescale can do the same, but only on the compressed chunks, via the compress_segmentby option. (Note that the sole chunk appearing in the queries above is uncompressed.)

Another way to group the data points belonging to the same series in Timescale is partitioning via add_dimension(). However, this solution is actively discouraged in the docs:

TimescaleDB does not benefit from a very large number of space partitions (such as the number of unique items you expect in partition field). A very large number of such partitions leads both to poorer per-partition load balancing (the mapping of items to partitions using hashing), as well as much increased planning latency for some types of queries.

I didn't try this yet, but indeed I doubt that PostgreSQL/Timescale will always do projection pushdowns properly, and that query planning will not become super slow with thousands of chunks per table. In fact, we already had problems with query planning latency with just hundreds of chunks (by time only), so we had to increase the chunking interval.

leventov commented 3 years ago

On the point of the inefficient data layout, partitioning/segmenting the data points by the series would also allow to throw away the index by series_id on the main table, which contributes almost a half to the total disk usage by this table.

mfreed commented 3 years ago

Hi @leventov pg_prometheus has been sunset for the new system released here, which should be much more performant, has native PromQL support, native compression, and more. It has a very different internal data model that the older version (which was released some 3 years ago).

https://github.com/timescale/timescale-prometheus

timescale / pg_prometheus