Aggregate functions get timed out

near / near-indexer-for-explorer

Watch NEAR network and store all the data from NEAR blockchain to PostgreSQL database

https://near-indexers.io/docs/projects/near-indexer-for-explorer

GNU General Public License v3.0

124 stars 56 forks source link

Aggregate functions get timed out #170

Closed roshkins closed 2 years ago

roshkins commented 2 years ago

I ran this query:

SELECT COUNT(t.signer_account_id)
FROM transaction_actions AS ta 
INNER JOIN transactions AS t 
ON ta.transaction_hash = t.transaction_hash 
WHERE t.receiver_account_id LIKE '%sputnik-dao.near' 
AND ta.args->>'method_name' = 'add_proposal'
--GROUP BY t.receiver_account_id
LIMIT 10;

and got this after trying a few times and getting a timeout after 30 seconds:

I am thinking that this might be an indexing problem. I tried it with GROUP BY and the JSON constraint commented out and I still got the error.

I am willing to try poking around to see if adding indices fixes the issue. I am no DB expert, and would love help and advice.

telezhnaya commented 2 years ago

Hi @roshkins !

The query is tough because of several reasons:

using LIKE in WHERE clause usually slows everything down
LIMIT also slows everything down, sometimes dramatically (see https://dba.stackexchange.com/questions/208065/why-is-limit-killing-performance-of-this-postgres-query/208072)

Sometimes it helps to take only limited piece of time (WHERE timestamp > ... AND timestamp < ...). The most important part here is to get rid of the last 10-20 minutes of blockchain, that is usually the reason for conflicts while performing the queries.

Could you please re-check the idea of the query? You use COUNT + LIMIT 10, and you do not use GROUP BY column for printing. I ran the query and got just a column of random numbers

frol commented 2 years ago

@roshkins Well, slow queries are going to be slow and we cannot afford letting running them forever (there is also PostgreSQL replica limitation gets into the play here - long-running queries block new data synced from the master database instance).

You will need to slim your query down and compute it in chunks (e.g. compute the stats per day/hour and then sum those values together). See https://github.com/telezhnaya/near-analytics/blob/main/aggregations/db_tables/daily_gas_used.py for inspiration.