Block data synchronization and full node operation on test3

dongwon8247 commented 1 year ago

Description

I want to share Onbloc's experience of synching block data and running a full node on testnet3 for our infra tools such as Adena and Gnoscan. I hope our early experience contributes to this issue and improves the block sync & full node operation experience for other Gnoland infra teams in the future.

cc @moul @zivkovicmilos @albttx @r3v4s

How we sync the block data and run a full node on test3

Run a script file requesting "https://rpc.test3.gno.land/block" every 15 seconds to determine if there is a recent block 1-1. if no block has been generated, wait another 15 seconds and execute (1.) 1-2. if there is a generated block, execute (2.)

The reason for doing it every 15 seconds is that we think it’s just the right amount of seconds to deal with irregular block generation time, because, currently on test3, a block is generated every minute if there’s no transactions, but if there’s is a transaction(s) in a block, it executes and the next block is generated right away (usually a second later). We could do it every 5 seconds or even lower to ensure real-time data but this is not a priority because we are on testnet and it could overload the RPC server, which isn’t necessary we thought.

Use a slightly modified tm2txport binary to parse information about the latest block 2-1. Parse block information 2-2. Parse transaction information
Bundle the block information & transaction information and store it in ES (Elastic Search) 3-1. If there is a transaction, it adds a temporary hash value (block height_transaction index) and saves it 3-2. If there’s no transaction, no hash value

Transaction hashes have been temporarily assigned with values in [block height_transaction index] format, as the tx hashing function is under development (#546).

If there is a transaction in step (3.), it is processed and stored in MySQL 4-1. Each type and function requires slightly different data for Gnoscan, so we are saving them individually.

This is the API swagger samples as a result of doing the above steps.

Why we had to do it this way?

Fetching data by polling RPCs seems to be unavoidable at the moment (there’s no way of using a push method such as websocket subscribe or multi-node)
Using binary is unavoidable => You can get transaction information via HTTP RPC, but the marshal/unmarshal process results in data outside of the ascii range, making it nearly impossible to parse
We haven't considered WebSocket RPC yet because push vs polling is more important than the speed of a protocol

Main problems

Since the block generation time is unpredictable, you need to periodically (15 seconds) request the "https://rpc.test3.gno.land/block" RPC to check whether the block is generated (Polling)
- if the timing is not right, the transaction created now will be accumulated in the DB after 14 seconds, and this causes a major UX issue in Adena/Gnoscan
The HTTP protocol speed is not that fast, so the sync can be pushed if blocks are generated every second when there are many transactions

Suggestions for solving problems

Add a Push type of data communication method (i.g. websocket)
- Push block/transaction events via websocket and store them in your own sync program
- This will solve most of the problems, and be sufficient for now
Multi-nodes
- Modify the gnoland binary to add logic to intervene in the block generation process and save it to the DB
- This will give infra teams more flexibility to process/modify block data on their own

moul commented 1 year ago

https://github.com/gnolang/gno/blob/408fc68d4b3c189dbc6a608c590a86c661ae232a/gno.land/cmd/gnoland/main.go#LL138C1-L138C1 -> CreateEmptyBlocks = true could help making block creation predictable.

jaekwon commented 1 year ago

We can set empty blocks to true. then blocks will come at regular intervals. If empty blocks is set to false, then blocks will come at intervals between the blocktime and the empty-block-timeout, so between 6 seconds and 60 seconds or in between depending on when the next tx comes through.

The solution to poll vs push is to use websockets and to implement what is already in TM1 but not TM2, would be TM2/rpc/core/events.go, where Subscribe is implemented. Subscribe would not be available as an HTTP rest API, only as a websocket request. See also TM1/rpc/core/routes.go which tells the TM1 RPC system that Subscribe isn't available as an HTTP rest API (rpc.NewWSRPCFunc vs rpc.NewRPCFunc).

Basically we should port TM2/rpc/core/events.go but without using the query system. A good first step would be to just not have the query argument at all, and to subscribe to ALL TM events.

Then we can discuss what types of TM events you need to listen to, and we can just filter on those message types. Hopefully we don't have to implement our own query-like system, but if we must, it is as simple and fast as possible. Please add me as reviewer for any related work here. If you only want to know when the next block comes through, that's an easy filter to implement -- filter only for EventNewBlockHeader events, see pkg/bft/types/events.go. EventNewBlockHeader should be lighter weight than EventNewBlock which includes the whole block info.

r3v4s commented 12 months ago

Hello @jaekwon @moul

Tested the current sync process with CreateEmptyBlocks = true, but it didn't help in predicting block creation.

Decreasing CreateEmptyBlocksInterval to 5s does create a block every 5 seconds, but I found 1 small issue.

If a new transaction occurs within 5 seconds, a new block is created immediately (which means it moves up BlockInterval time). Is this intended?

If it is intended, what is the purpose of doing this? Wouldn't it be better to create blocks regularly regardless of new transactions?

What do you think?

Testing

Testing Option 1.

CreateEmptyBlocks = false
CreateEmptyBlocksInterval = 5 * time.Second

> when there is a new tx, a new block(that contains the requested tx) gets created
> and right after that(maybe 1 ~ 2s) another new empty block gets created
> when there isn't any new tx, a new empty block gets created regularly every 5 seconds

Testing Option 2.

CreateEmptyBlocks = true
CreateEmptyBlocksInterval = 5 * time.Second

> when there is a new tx, a new block(that contains the requested tx) gets created
> and right after that(maybe 1 ~ 2s) another new empty block gets created
> when there isn't any new tx, a new empty block gets created regularly every 5 seconds
>> it seems to be `CreateEmptyBlocks` doesn't get affect when block interval is positive value

Testing Option 3.

CreateEmptyBlocks = false
CreateEmptyBlocksInterval = 0 * time.Second

> when there is a new tx, a new block(that contains the requested tx) gets created
> and right after that(maybe 1 ~ 2s) another new empty block gets created
> when there isn't any new tx, wait for the next tx (=> doesn't create any empty block)

Testing Option 4.

CreateEmptyBlocks = true
CreateEmptyBlocksInterval = 0 * time.Second

> regardless of a new tx, the block is created on a 1-second interval

jaekwon commented 11 months ago

Try increasing TimeoutCommit.

from tm2/pkg/bft/consensus/config/config.go:

// Commit returns the amount of time to wait for straggler votes after receiving +2/3 precommits for a single block (ie. a commit).

These comments could be duplicated above ConsensusConfig for better documentation.

BTW if TimeoutCommit is too low, then validators may appear as if they are offline if they are on the edge of the gossip network, or otherwise somehow slower to catch up or broadcast votes. The cosmos hub (gaia) uses the presence of validators in the Commit (which is +2/3 of precommit votes) to determine the liveness of validators. Not a problem for you if you want a large TimeoutCommit though.

Please reassign if this doesn't work or there are other questions.

r3v4s commented 11 months ago

Try increasing TimeoutCommit.

from tm2/pkg/bft/consensus/config/config.go:

// Commit returns the amount of time to wait for straggler votes after receiving +2/3 precommits for a single block (ie. a commit).

These comments could be duplicated above ConsensusConfig for better documentation.

BTW if TimeoutCommit is too low, then validators may appear as if they are offline if they are on the edge of the gossip network, or otherwise somehow slower to catch up or broadcast votes. The cosmos hub (gaia) uses the presence of validators in the Commit (which is +2/3 of precommit votes) to determine the liveness of validators. Not a problem for you if you want a large TimeoutCommit though.

Please reassign if this doesn't work or there are other questions.

Big thanks for your comment! I (think) have resolved issue in #969. Please take a look

gnolang / gno