Optimize Explorer stats

frol commented 3 years ago

Currently, https://explorer.near.org/stats page relies on heavy queries to be executed and cached every few hours (we could update it daily, but for redundancy reasons we do it more often than that).

It often gets stuck for one reason or another, and we may benefit avoiding the heavy queries and instead have a table with the data added daily.

According to our plan now, we are going to keep extending Indexer for Explorer with more regular tasks filling up some extra tables.

@telezhnaya This might be on our table down the road, so just keep that in mind while we work on the circulating supply and balance changes.

telezhnaya commented 3 years ago

I analyzed the queries we are doing for the stats page.

I am completely sure we can calculate and store all AggregatedByDate stats in DB, that would be easy to calculate and cheap to store. It's one table aggregated__daily_statistics with one line per day. (suggest better naming if you have) queryActiveAccountsCountAggregatedByWeek could go also to aggregated__daily_statistics, we can group by week in select later if we need to.
I'd prefer to extract partner's queries to a separate table, where each line is responsible for day and partner, which means 16 lines for each day. List of partners is here. It easily covers queryPartnerTotalTransactions, queryPartnerFirstThreeMonthTransactions, queryPartnerUniqueUserAmount.
getLockupAccountIds and getLastYesterdayBlock are needed only for circulating supply that will be dropped from near-explorer in nearest month.
Finally, we have queryActiveAccountsList and queryActiveContractsList. We also have the dream to speed up getting transaction count for each user, it's convenient to discuss it here. With the fact that we have a maximum of 20k active users per day, I think we can store stats for all active accounts in the third table. And stats for contracts in the fourth table. If we have the total number of user's transactions for yesterday, it's fast to count the total tx number for each user, because the query for counting only the last day is fast enough. These 2 tables sound to me like the massive zipping of account_changes. Only the main details from it. It sounds good for any kind of analysis. I'd prefer to think more about the design of these tables, we can aggregate here more than we have now and add some new features on the road.

The thing that I want to mention here is partitioning. I am not sure how partitioning will affect our performance for these tasks, but anyway I'd prefer to start from configuring partitioning and look at the performance. Sounds like we anyway want to do these performance improvements, but maybe this task will become less burning.

Tip to myself: while implementing, we could include not only "daily" stats for everything, but also cumulative stats (sum overall and/or for the last 2 weeks). That will also simplify queries for us.

frol commented 3 years ago

Thanks for digging into it and summarizing it!

while implementing, we could include not only "daily" stats for everything, but also cumulative stats (sum overall and/or for the last 2 weeks). That will also simplify queries for us.

Cumulative stats are super straightforward to implement on the client side, so I don't think we need to store those in the database

telezhnaya commented 3 years ago

I surely prefer to implement the solution with DB mentioned above (because I like analytical data), but I anyway need to mention: maybe we want simply to cache the results on the backend side? It will surely cover scenarios 1 and 2 without any side effects. Though, scenario 4 needs to have prepared data in DB, I suppose it's too memory-consuming to store it in the cache.

frol commented 3 years ago

Caching might be the way to go, indeed.

frol commented 3 years ago

Counting the number of new accounts is impossible as of today:

SELECT
        DATE_TRUNC('day', TO_TIMESTAMP(receipts.included_in_block_timestamp / 1000000000)) AS date,
        COUNT(created_by_receipt_id) AS new_accounts_count_by_date
      FROM accounts
      JOIN receipts ON receipts.receipt_id = accounts.created_by_receipt_id
      GROUP BY date
      ORDER BY date;

It just times out all the time.

telezhnaya commented 3 years ago

Just a note: while reading the code more carefully, I understood that our approach means caching by default. So the only way to speed up stats that I see is to create a separate table in Indexer with stats

telezhnaya commented 3 years ago

queryActiveAccountsCountAggregatedByWeek could go also to aggregated__daily_statistics, we can group by week in select later if we need to.

I lied, it's not the same. We can't calculate the same weekly number based on daily numbers.

Example: User A is active every day except Monday, user B is active only on Monday, user C is active only on Wednesday. The week of active users will be represented by the vector 1 1 2 1 1 1 1. Based on the vector, it's not possible to guess the right answer (3 active users, A B C).

Noone cares about exact numbers, so we can choose between grouping functions (average, median, max, sum). But it could be far away from the truth if we will have different sets of users for each day. It's better to clarify how accurate should we be here.

We have 2 other solutions:

just present it daily (I vote for it if it's OK)
create separate aggregated__weekly_statistics

telezhnaya commented 2 years ago

I've created a scratch of Python implementation with collecting statistics: https://github.com/telezhnaya/near-analytics

frol commented 2 years ago

We should not mislead someone relying on the charts, so I suggest we either represent the exact truth or don't expose the statistic if we cannot reliably collect it. For some of the weekly charts, we will need to have separate tables.

telezhnaya commented 2 years ago

We solved the problem with https://github.com/near/near-analytics

near / near-indexer-for-explorer

Optimize Explorer stats #137