[Bug]: Queries are faster on decompressed data than on compressed data

What type of bug is this?

Performance issue

What subsystems and features are affected?

Compression

What happened?

Running query from https://docs.timescale.com/tutorials/latest/blockchain-query/blockchain-compress/#taking-advantage-of-query-speedups gives better performance when data is decompressed and not when it's compressed.

TimescaleDB version affected

2.15.3

PostgreSQL version used

16.3

What operating system did you use?

Linux 6.5.0-15-generic

What installation method did you use?

Docker

What platform did you run on?

Not applicable

Relevant log output and stack trace

No response

How can we reproduce the bug?

Follow https://docs.timescale.com/tutorials/latest/blockchain-query/ part1

postgres=# SELECT decompress_chunk(c) from show_chunks('transactions') c;
NOTICE:  chunk "_hyper_1_1_chunk" is not compressed

postgres=# WITH recent_blocks AS (
 SELECT block_id FROM transactions
 WHERE is_coinbase IS TRUE
 ORDER BY time DESC
 LIMIT 5
)
SELECT
 t.block_id, count(*) AS transaction_count,
 SUM(weight) AS block_weight,
 SUM(output_total_usd) AS block_value_usd
FROM transactions t
INNER JOIN recent_blocks b ON b.block_id = t.block_id
WHERE is_coinbase IS NOT TRUE
GROUP BY t.block_id;
 block_id | transaction_count | block_weight |  block_value_usd
----------+-------------------+--------------+--------------------
   817875 |              3598 |      3991624 | 155280878.58270007
   817874 |              2075 |      3991937 | 12550755.632199755
   817876 |              2031 |      3991907 | 22534138.943499986
   817873 |              3841 |      3995610 | 145121447.89950028
   817877 |              2267 |      3991942 |  68192037.55160005
(5 rows)

Time: 33.118 ms

Then compress it again

postgres=# SELECT compress_chunk(c) from show_chunks('transactions') c;
             compress_chunk
----------------------------------------
 _timescaledb_internal._hyper_1_1_chunk
(1 row)

Time: 27348.346 ms (00:27.348)

And try again a few times and you'll see higher latency no matter what

postgres=# WITH recent_blocks AS (
 SELECT block_id FROM transactions
 WHERE is_coinbase IS TRUE
 ORDER BY time DESC
 LIMIT 5
)
SELECT
 t.block_id, count(*) AS transaction_count,
 SUM(weight) AS block_weight,
 SUM(output_total_usd) AS block_value_usd
FROM transactions t
INNER JOIN recent_blocks b ON b.block_id = t.block_id
WHERE is_coinbase IS NOT TRUE
GROUP BY t.block_id;
 block_id | transaction_count | block_weight |  block_value_usd
----------+-------------------+--------------+--------------------
   817875 |              3598 |      3991624 | 155280878.58270007
   817874 |              2075 |      3991937 | 12550755.632199755
   817876 |              2031 |      3991907 | 22534138.943499986
   817873 |              3841 |      3995610 | 145121447.89950028
   817877 |              2267 |      3991942 |  68192037.55160005
(5 rows)

Time: 88.454 ms

Can confirm:

--- compressed:
 HashAggregate  (cost=82315.04..82315.05 rows=1 width=28) (actual time=55.340..55.343 rows=5 loops=1)
   Group Key: t.block_id
   Batches: 1  Memory Usage: 24kB
   ->  Nested Loop  (cost=51296.47..51515.04 rows=3080000 width=16) (actual time=52.828..54.190 rows=13812 loops=1)
         ->  Limit  (cost=51295.49..51295.51 rows=5 width=12) (actual time=52.777..52.779 rows=5 loops=1)
               ->  Sort  (cost=51295.49..58995.49 rows=3080000 width=12) (actual time=52.776..52.777 rows=5 loops=1)
                     Sort Key: _hyper_39_22282_chunk."time" DESC
                     Sort Method: top-N heapsort  Memory: 25kB
                     ->  Custom Scan (DecompressChunk) on _hyper_39_22282_chunk  (cost=0.04..137.80 rows=3080000 width=12) (actual time=0.022..52.712 rows=700 loops=1)
                           Filter: (is_coinbase IS TRUE)
                           Rows Removed by Filter: 2718385
                           ->  Seq Scan on compress_hyper_40_22283_chunk compress_hyper_40_22283_chunk_1  (cost=0.00..137.80 rows=3080 width=88) (actual time=0.003..0.158 rows=3080 loops=1)
         ->  Custom Scan (DecompressChunk) on _hyper_39_22282_chunk t  (cost=0.97..3.90 rows=4000 width=16) (actual time=0.016..0.159 rows=2762 loops=5)
               Filter: (is_coinbase IS NOT TRUE)
               Rows Removed by Filter: 1
               ->  Index Scan using compress_hyper_40_22283_chunk_block_id__ts_meta_sequence_nu_idx on compress_hyper_40_22283_chunk  (cost=0.28..3.90 rows=4 width=104) (actual time=0.003..0.003 rows=3 loops=5)
                     Index Cond: (block_id = _hyper_39_22282_chunk.block_id)
 Planning Time: 6.816 ms
 Execution Time: 55.388 ms

--- uncompressed:
 HashAggregate  (cost=138525.20..138527.20 rows=200 width=28) (actual time=3.812..3.813 rows=5 loops=1)
   Group Key: t.block_id
   Batches: 1  Memory Usage: 40kB
   ->  Nested Loop  (cost=338.34..137465.62 rows=105958 width=16) (actual time=0.196..2.854 rows=13812 loops=1)
         ->  Limit  (cost=0.43..0.78 rows=5 width=12) (actual time=0.158..0.853 rows=5 loops=1)
               ->  Index Scan using _hyper_39_22282_chunk_transactions_time_idx on _hyper_39_22282_chunk  (cost=0.43..296392.26 rows=4238304 width=12) (actual time=0.157..0.852 rows=5 loops=1)
                     Filter: (is_coinbase IS TRUE)
                     Rows Removed by Filter: 11812
         ->  Bitmap Heap Scan on _hyper_39_22282_chunk t  (cost=337.91..27281.04 rows=21192 width=16) (actual time=0.037..0.299 rows=2762 loops=5)
               Recheck Cond: (block_id = _hyper_39_22282_chunk.block_id)
               Filter: (is_coinbase IS NOT TRUE)
               Rows Removed by Filter: 1
               Heap Blocks: exact=777
               ->  Bitmap Index Scan on _hyper_39_22282_chunk_block_idx  (cost=0.00..332.61 rows=42383 width=0) (actual time=0.027..0.027 rows=2763 loops=5)
                     Index Cond: (block_id = _hyper_39_22282_chunk.block_id)
 Planning Time: 4.370 ms
 Execution Time: 3.839 ms
(17 rows)

Time: 8.570 ms

One reason is that we don't have a specialized compression algorithm and hence no vectorized filters for bool columns. There is a workaround for this, making is_coinbase a smallint instead, it makes the difference less drastic, but it's still worse:

explain (costs off, analyze) WITH recent_blocks AS (
 SELECT block_id FROM transactions
 WHERE is_coinbase != 0
 ORDER BY time DESC
 LIMIT 5
)
SELECT
 t.block_id, count(*) AS transaction_count,
 SUM(weight) AS block_weight,
 SUM(output_total_usd) AS block_value_usd
FROM transactions t
INNER JOIN recent_blocks b ON b.block_id = t.block_id
WHERE is_coinbase = 0
GROUP BY t.block_id;
                                                                                  QUERY PLAN                                                                                   
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
 HashAggregate (actual time=9.693..9.694 rows=5 loops=1)
   Group Key: t.block_id
   Batches: 1  Memory Usage: 24kB
   ->  Nested Loop (actual time=7.869..8.831 rows=13812 loops=1)
         ->  Limit (actual time=7.854..7.855 rows=5 loops=1)
               ->  Sort (actual time=7.853..7.854 rows=5 loops=1)
                     Sort Key: _hyper_39_22282_chunk."time" DESC
                     Sort Method: top-N heapsort  Memory: 25kB
                     ->  Custom Scan (DecompressChunk) on _hyper_39_22282_chunk (actual time=0.025..7.802 rows=700 loops=1)
                           Vectorized Filter: (is_coinbase <> 0)
                           Rows Removed by Filter: 2718385
                           ->  Seq Scan on compress_hyper_41_22284_chunk compress_hyper_41_22284_chunk_1 (actual time=0.003..0.157 rows=3080 loops=1)
         ->  Custom Scan (DecompressChunk) on _hyper_39_22282_chunk t (actual time=0.008..0.097 rows=2762 loops=5)
               Vectorized Filter: (is_coinbase = 0)
               Rows Removed by Filter: 1
               ->  Index Scan using compress_hyper_41_22284_chunk_block_id__ts_meta_sequence_nu_idx on compress_hyper_41_22284_chunk (actual time=0.001..0.001 rows=3 loops=5)
                     Index Cond: (block_id = _hyper_39_22282_chunk.block_id)
 Planning Time: 6.280 ms
 Execution Time: 9.721 ms

Another problem is that in the compressed case the sort on time is not pushed down to DecompressChunk, because the work_mem setting is low. You can improve things by e.g. set work_mem to '64MB' (this does not require the smallint workaround, it's an independent improvement):

 HashAggregate (actual time=2.664..2.667 rows=5 loops=1)
   Group Key: t.block_id
   Batches: 1  Memory Usage: 40kB
   ->  Nested Loop (actual time=0.554..1.778 rows=13812 loops=1)
         ->  Limit (actual time=0.535..0.780 rows=5 loops=1)
               ->  Custom Scan (DecompressChunk) on _hyper_39_22282_chunk (actual time=0.534..0.778 rows=5 loops=1)
                     Filter: (is_coinbase <> 0)
                     Rows Removed by Filter: 14654
                     ->  Sort (actual time=0.440..0.441 rows=18 loops=1)
                           Sort Key: compress_hyper_41_22285_chunk_1._ts_meta_max_1 DESC
                           Sort Method: quicksort  Memory: 968kB
                           ->  Seq Scan on compress_hyper_41_22285_chunk compress_hyper_41_22285_chunk_1 (actual time=0.004..0.166 rows=3080 loops=1)
         ->  Custom Scan (DecompressChunk) on _hyper_39_22282_chunk t (actual time=0.010..0.101 rows=2762 loops=5)
               Vectorized Filter: (is_coinbase = 0)
               Rows Removed by Filter: 1
               ->  Index Scan using compress_hyper_41_22285_chunk_block_id__ts_meta_sequence_nu_idx on compress_hyper_41_22285_chunk (actual time=0.001..0.001 rows=3 loops=5)
                     Index Cond: (block_id = _hyper_39_22282_chunk.block_id)
 Planning Time: 5.686 ms
 Execution Time: 2.697 ms

timescale / timescaledb