near / near-indexer-for-explorer

Watch NEAR network and store all the data from NEAR blockchain to PostgreSQL database
https://near-indexers.io/docs/projects/near-indexer-for-explorer
GNU General Public License v3.0
123 stars 56 forks source link

Indexation from scratch seems abnormaly long #345

Open klefevre opened 1 year ago

klefevre commented 1 year ago

Hi all,

After some quirks we finally started an indexation from block 1 (actually 9,820,210 but you get my point) since last week and it started quite gracefully. As I mentioned in a previous issue, we saw from time to time ForeignKeyViolation errors that the internal retry handles correctly:

ERROR explorer_database:Error occurred during DatabaseError(ForeignKeyViolation, "insert or update on table \"receipts\" violates foreign key constraint \"tx_receipt_fk\""):

However these errors got worse day after day. Currently we're approaching 40M blocks and we see these errors every ~5 blocks which slow down dramatically the entire process. We estimate that if it continues like this, we won't finish indexing until months. I don't know what's the average time to fully index Near but I believe something should be wrong on our side.

Is it the intended behavior or did we miss something?

Note that, as suggested, we run the indexer with --concurrency 1 in strict mode to avoid any issue with the data.

Please, forgive me but I didn't (yet) dig into the code to understand how it works and therefore I don't have all the ins and outs of why it works like. While in sync mode it makes perfectly sens, could you explain why we do need to use the strict mode to index "old" data that, I presume, should be fetched from S3? If anybody has any hints to allow us to speed up the indexation while keeping data integrity it would be awesome!

Best,

telezhnaya commented 1 year ago

Currently we're approaching 40M blocks and we see these errors every ~5 blocks which slow down dramatically the entire process.

Both statements are true, but there's no causal relationship. The process slows down because the blockchain becomes more live, there are much more transactions, receipts and all other entities that we need to track. Be ready that the processing speed will be slower and slower.

The logic that causes these errors actually makes the situation better, it speeds up the processing. But if you insist, just make these 2 futures run sequentially by adding await on each of them: https://github.com/near/near-indexer-for-explorer/blob/master/indexer/src/main.rs#L42-L67

You can speed up the process by dropping some logic. E.g., if you don't need fungible token data or account changes table, just comment the collecting of this data. Be careful, not all the tables could be commented: e.g., most of the tables has FK on blocks, so blocks is required (or you need to drop FKs as well)

khorolets commented 1 year ago

Hey @klefevre! Happy to hear you've managed to start indexing from the genesis!

We do expect database errors to occur from time to time when we are trying to concurrently store data about receipts and transactions and it is auto-resolving. However, I am not sure you've shared the expected one. I'll ask @telezhnaya if she remembers.

we won't finish indexing until months

Sorry you haven't expected it, but it's normal

could you explain why we do need to use the strict mode to index "old" data that, I presume, should be fetched from S3?

That's actually a feature of this indexer (Indexer for Explorer). It stores the data in a relational database and we require it to be consistent. The easiest approach for us is to use a bunch of constraints in the database schema.

Strict mode is a normal/regular way of running this indexer. And non-strict (I bet the name is not the best here) was introduced to deal with problems that happen from time to time. For instance, you're running a couple of nodes, and one of them is dropped out of sync for some reason you need to proceed with showing the latest (not complete) data to the users while the other node is trying to catch up and fill the gaps. In this case, you run one of the nodes in non-strict mode and it starts to omit some of the data (StateChanges in our case) and does a limited number of retries when the data is not consistent.

Note that, as suggested, we run the indexer with --concurrency 1 in strict mode to avoid any issue with the data.

Since you've started your indexer from genesis I assume you care about consistency and it is the right way to run it. Actually, we created the concurrency key before we realized it's useless for this indexer to run with any concurrency except 1 :(

As a summary:

Indexer for Explorer is made to serve the Explorer's needs. It requires all the data to be present and to be consistent. In most cases, if you're not running your own Explorer this indexer is overkill and perhaps you should consider creating your own, custom, indexer that handles the data you need and in the way you need it.

Feel free to ask questions or close the issue if you got the answers :)

klefevre commented 1 year ago

First of all, thank you both for these these quick answers!

The process slows down because the blockchain becomes more live, there are much more transactions, receipts and all other entities that we need to track. Be ready that the processing speed will be slower and slower.

We indeed expected that the indexing would slow down but as it started like crazy, even after few days, we were surprised that it slowed down that much! :)

It stores the data in a relational database and we require it to be consistent. The easiest approach for us is to use a bunch of constraints in the database schema.

I fully understand this requirement. However my point about asking why strict-mode was required for old data is that, if we assume data stored on S3 are valid, inserting into DB with these constraints might be unnecessary?

Again, I need to dig into the code, but does the indexer actually use these fk errors to fetch missing data? Or is, somehow, just a way to wait before next block to be processed? If the last, then could we imagine dropping all fk and indexes during initial indexation i.e. from genesis block to last block and restoring them afterward?

The logic that causes these errors actually makes the situation better, it speeds up the processing. But if you insist, just make these 2 futures run sequentially by adding await on each of them: https://github.com/near/near-indexer-for-explorer/blob/master/indexer/src/main.rs#L42-L67

As I said, I need to dig into the code, but that's actually pretty promising! I'll check what I can do. Also, I indeed don't need every tables but the one I need are the biggest therefore it was kind of superficial to remove them from a db size point of view but if it could accelerate significantly the indexing process I'll check that too.

telezhnaya commented 1 year ago

the one I need are the biggest

Which one do you need, btw?

klefevre commented 1 year ago

We need:

telezhnaya commented 1 year ago

@klefevre take a look at https://github.com/near/near-microindexers

They have improved structure, additional columns Also, we use there sqlx instead of diesel, so it may be a little faster (but I haven't measured)

klefevre commented 1 year ago

@telezhnaya That's awesome! How did i not see this repo before!?

If I follow you correctly, If I want all tables I indicated I'll need to merge logics from both indexer-balances (your assumption is right, I need to follow balances updates) and indexer-events repositories to get my own indexer. Right?

Also, you didn't answer my question regarding potentially dropping foreign keys and indexes before starting indexing. Once finished, I would then add them. As I don't need to perform queries during initial indexation, this would speed up the process like crazy. However if indexers somehow use these constraints to dedup stuff then I'll keep them. WDYT?

telezhnaya commented 1 year ago

@klefevre

I'll need to merge logics

Not even merge, just run them in parallel :) Note, if you want transactions/receipts - there is indexer_base for this, but I haven't tested it massively. I think I'd change the structure more. So maybe it's better still to use indexer-for-explorer for the rest of the tables.

Also, you didn't answer my question regarding potentially dropping foreign keys and indexes before starting indexing. Once finished, I would then add them.

You need to keep primary keys and unique constraints/indexes. All the others (including all non-unique indexes, btree or whatever) you can drop and create in the end.

telezhnaya commented 1 year ago

@klefevre upd: be careful with dropping indexes It may dramatically slow down e.g. this code https://github.com/near/near-indexer-for-explorer/blob/master/database/src/adapters/receipts.rs#L247 Because we need to find parent transaction hash, that's why we run several select statements, and they may run infinitely without indexes

It's hard to read SELECT statements for diesel You can find the raw ones here https://github.com/near/near-microindexers/blob/main/indexer-base/src/db_adapters/receipts.rs They are the same as in indexer-for-explorer