Open blazesal opened 2 years ago
Thanks for the in-depth analysis! We do have some code for setting up indexes. I think we can add these there.
There are two remarks.
--disable-indexes
option?gin_trgm_ops
requires the admin to install additional Posgres module. Only then the index may get created by the application. Should you create a README entry about it and create a second migration command -m2
?How does the gin_trgm_ops
index impact the results that are matched? The current behavior of exactly matching arbitrary substrings is pretty useful for code search. Is this index strictly a performance improvement with no change to search results or will it impact the results returned?
Adding an index or changing its type should not change the semantics of a SQL query. Standard indexes support fast searching of open-ended patterns LIKE 'abde%'
. In the transaction search endpoint, the column code
is being searched for the pattern LIKE '%abdce%'
, hence another type of index is required that supports it in an efficient manner.
I executed the endpoint with and without the index to compare execution times and the results:
http://{{indexer}}:8080/txs/search?limit=200&search=k:5adb16663073280acf63bc2a4bf477ad1391736dcd6217b094926862c72d15c9
Without the index endpoint takes around 60 second to complete. With the index it takes about 3-6 seconds. The query sorts the transactions by the block height, but within a single height the transactions may be (and are) in a different order.
Transctions for block height 2210429 - query results with no index:
Zj0meNG6rZKUqvGxZxMjW-bUEbQBIX-duRPDpuEOvKs
o9GzLWNqji5gndkLOP44wy3V_pTnq6HZZbaDmR-BXOQ
l0STasLY3eh_7n4xzinSfqSdEAi7RItsm_xa_2eUOm4
tTVAzxG6Z_OR8x7WqsM1lPnNhNc2odMCAX3uTbEGoLc
7oGhKqC7lVfpM0-8vfGNRXp3L1F5F7x0qI93NFT4zRo
Wox2ao4bCtFT-W2E17sA0Nh6JmHCN0y-KWQPYMMuiqw
2c00Kr23LYNUV7ldLovYJzrDqAFTpL8SeSxTXd-OjNE
Transctions for block height 2210429 - query results with the gin_trgm_ops
index:
2c00Kr23LYNUV7ldLovYJzrDqAFTpL8SeSxTXd-OjNE
7oGhKqC7lVfpM0-8vfGNRXp3L1F5F7x0qI93NFT4zRo
o9GzLWNqji5gndkLOP44wy3V_pTnq6HZZbaDmR-BXOQ
l0STasLY3eh_7n4xzinSfqSdEAi7RItsm_xa_2eUOm4
Zj0meNG6rZKUqvGxZxMjW-bUEbQBIX-duRPDpuEOvKs
tTVAzxG6Z_OR8x7WqsM1lPnNhNc2odMCAX3uTbEGoLc
Wox2ao4bCtFT-W2E17sA0Nh6JmHCN0y-KWQPYMMuiqw
That difference between results is acceptable because the query execution with an index may deliver results in different order than no index query execution and there is no additional constrain on the results order. I recommend you execute the queries in those two variants on your end and confirm my findings.
Hi @blazesal, thank you very much for opening this ticket with valuable analyses, insights and recommendations. I wanted to let you know that this is something we've been discussing internally. As you've suggested the GIN index is a very big aid to the query planner for running a search query like the one you're focusing on here. But unfortunately, that index can hurt as much as it helps for search queries with different characteristics. Let me expand;
Consider the long-running query you've identified in your issue description. In particular the ORDER BY "t1"."height" DESC
clause. For your query, that ORDER BY
clause doesn't hurt, because the term you're searching for results in a tiny number of rows to be selected before ordering. Then Postgres can easily sort those rows by height.
However, imagine the opposite case of a search query that will pick millions of rows. Like "give me any transaction that has "transfer" in it". Currently, such a query runs fast, because the query planner will just walk backwards over the height index of the transactions table and it will quickly find, say 100 transactions (for a request with limit = 100
) that contain the string transfer
and return them. Now, if we add the GIN index you're suggesting here, the query planner will instead walk over that GIN index and find all the transactions that contain "transfer" (millions) and then it will sort them to find the first 100 by height, which will probably have an even worse run time since scanning a GIN index is slower than scanning the table.
Just wanted to let you know that this problem with the opposite case is what held us back from applying your insightful suggestion right away.
@blazesal Note that the requestkey
indexes you've suggested in your comment above have become a part of the indexes chainweb-data
creates by default with #98. Thanks again for pointing them out.
Indexer performance problems
Investigation
Naming
In the description below following names are being used for:
postgres
,chainweb-data
,db_user
.Increase debug query size
Increase the query size, so it is possible to display large queries:
Intercept long-running query
Invoke the indexer's endpoint, which fails to return the data quickly:
Connect to the database and display the running SQL queries:
Then analyze the query:
Analysis gives the following query plan:
The line
-> Parallel Seq Scan on transactions t0 (cost=0.00..554104.98 rows=213 width=1056) (actual time=521.416..2078.174 rows=27 loops=3)"
indicates that the whole table needs to get sequentially scanned (Seq Scan
) because of the lack of appropriate index on thecode
column. After adding the necessary index the query plan is the following:The
Seq Scan
is now gone in favour of scanning the indexBitmap Index Scan on ""transactions-code""
.Solution
The discussion about the right index supporting generic
LIKE
queries is here: https://stackoverflow.com/questions/1566717/postgresql-like-query-performance-variations In summary, the use of GIN or GiST trigram index with the special operator classes provided by the additional modulepg_trgm
is the solution.Install additional Postgres module
To install the needed module, log into database with super-user:
Identify missing indexes
The analysis of the queries resulted in the following list of the indexes that need to get created:
btree
btree
gin_trgm_ops
gin_trgm_ops
gin_trgm_ops
Create appropriate indexes
The commands to create the above indexes:
Should this section get included in the README?