hydradatabase / hydra

Hydra: Column-oriented Postgres. Add scalable analytics to your project in minutes.
https://www.hydra.so
Apache License 2.0
2.8k stars 72 forks source link

indexes are not preferred by the planner #155

Open millken opened 11 months ago

millken commented 11 months ago

What's wrong?

I used two tables public.receipts and reader.receipts in the test,public.receipts is a heap table, and reader.receipts is generated as columnar using pg_ivm. They have the same hash indexes on action_hash.

testnet=# \d+ public.block_receipts;
                                                       Unlogged table "public.block_receipts"
        Column        |          Type          | Collation | Nullable |        Default        | Storage  | Compression | Stats target | Description 
----------------------+------------------------+-----------+----------+-----------------------+----------+-------------+--------------+-------------
 id                   | bigint                 |           | not null |                       | plain    |             |              | 
 block_height         | bigint                 |           |          |                       | plain    |             |              | 
 action_hash          | character varying(64)  |           | not null |                       | extended |             |              | 
 gas_consumed         | integer                |           | not null | 0                     | plain    |             |              | 
 contract_address     | character varying(42)  |           | not null | ''::character varying | extended |             |              | 
 status               | smallint               |           | not null | 0                     | plain    |             |              | 
 execution_revert_msg | character varying(255) |           | not null | ''::character varying | extended |             |              | 
Indexes:
    "block_receipts_pkey" PRIMARY KEY, btree (id)
    "idx_block_receipts_action_hash" hash (action_hash)
Access method: heap
testnet=# \d+ reader.block_receipts;
                                                    Table "reader.block_receipts"
        Column        |          Type          | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 
----------------------+------------------------+-----------+----------+---------+----------+-------------+--------------+-------------
 id                   | bigint                 |           |          |         | plain    |             |              | 
 block_height         | bigint                 |           |          |         | plain    |             |              | 
 action_hash          | character varying(64)  |           |          |         | extended |             |              | 
 gas_consumed         | integer                |           |          |         | plain    |             |              | 
 contract_address     | character varying(42)  |           |          |         | extended |             |              | 
 status               | smallint               |           |          |         | plain    |             |              | 
 execution_revert_msg | character varying(255) |           |          |         | extended |             |              | 
Indexes:
    "block_receipts_index" UNIQUE, btree (id)
    "idx_block_receipts_action_hash" hash (action_hash)
Access method: columnar

Then use the following query

select id,block_height,action_hash,gas_consumed,contract_address,status,execution_revert_msg 
from reader.block_receipts 
where action_hash='3866a5be503847400594d1911e5a411a83fc8b10f1ecd1aef4cb89bee7beeb94';

I have noticed that queries on columnar databases are relatively slow, and the query performance is comparable to when using no indexes.

explain results

#On Heap
 Index Scan using idx_block_receipts_action_hash on block_receipts  (cost=0.00..113907.92 rows=125081 width=786)
   Index Cond: ((action_hash)::text = '3866a5be503847400594d1911e5a411a83fc8b10f1ecd1aef4cb89bee7beeb94'::text)
 JIT:
   Functions: 2
   Options: Inlining false, Optimization false, Expressions true, Deforming true
(5 rows)
#On Columnar
 Gather  (cost=1000.00..30599.12 rows=125081 width=786)
   Workers Planned: 7
   ->  Parallel Custom Scan (ColumnarScan) on block_receipts  (cost=0.00..17091.02 rows=3573749 width=786)
         Filter: ((action_hash)::text = '3866a5be503847400594d1911e5a411a83fc8b10f1ecd1aef4cb89bee7beeb94'::text)
         Columnar Projected Columns: id, block_height, action_hash, gas_consumed, contract_address, status, execution_revert_msg
(5 rows)

Time: 0.941 ms
JerrySievert commented 11 months ago

hi, thanks for the report!

before moving to indexes, there are a couple of things I've noticed, which might make columnar storage be a less than ideal candidate for you:

  1. you're retrieving every column, which essentially makes the storage engine retrieve, decompress, and reconstruct every column back into a row - this will likely be a little bit slower that heap storage because of the extra overhead. columnar storage works best when you are selecting a subset of the columns available.
  2. it looks like you're querying on a generated hash, which I'm guessing is randomly interspersed throughout the data. the columnar engine attempts to use a minimum and maximum value per chunk (a group of columns stored and compressed together) in order to eliminate whole chunks from even being searched. this works really well on ordered data, but if a column is essentially randomly placed, it ends up checking every chunk for the value, which is a worst case scenario.

ok, now to indexes!

postgres attempts to choose the best plan available for querying, and is usually right, but we can check those assumptions. you can try disabling the columnar custom scan:

SET columnar.enable_custom_scan = 'f';

this will disable the custom scan, and drop back to either a sequential scan or an index scan based on whichever postgres thinks will be faster. if it gives you a sequential scan, you can disable that as well:

SET enable_seqscan = 'f';

that should be enough to drop you to an index scan, and you will likely see that it is no faster than not using the index (partially due to number 1 in the notes above). you can also use pg_hint_plan and attempt to hint the planner to use an index. you can learn more on our indexes and indexing strategies page.

wuputah commented 11 months ago

Hi there! This is somewhere between a known issue and a limitation of columnar systems.

Right now we have made some effort to improve the performance of index-backed scans when indexes are used by the planner but we have not changed the planner or the columnar custom scan to make better use of indexes. As Jerry said, you can hint to the planner to force it to use an index for a given query - this is mostly a workaround for the fact that we have tuned the planner to prefer our parallel scan for most queries. As a result, indexes will be most useful for power users who want to experiment and try tuning their queries with pg_hint_plan. So in that regard, it's doing what we would expect.

Generally columnar is not ideal for "needle in the haystack" queries like the one you ran (at least, I'm assuming that action_hash is unique or very nearly so, i.e. has very high cardinality). That said, this is the exact kind of query that a index would help with if it was picked up, since at least in this case the index would help the database narrow down the haystack to a stripe/chunk. But no matter what we do, for this specific query, you will always get better performance for this query with a btree index on a heap table.

millken commented 11 months ago

@JerrySievert Thank you very much for your reply. I have solved the problem of not being able to use indexes by using pg_hint_plan.