hermeznetwork / hermez-node

Hermez node Go implementation
GNU Affero General Public License v3.0
58 stars 33 forks source link

TODO: Improve how StatsHolder and Synchronizer handle concurrently reads and writes #1087

Closed tclemos closed 3 years ago

tclemos commented 3 years ago

Rationale

https://github.com/hermeznetwork/hermez-node/blob/3fc5b4c531394e9d6a9975e2fa798bbd5029e6a4/synchronizer/synchronizer.go#L97-L100

https://github.com/hermeznetwork/hermez-node/blob/3fc5b4c531394e9d6a9975e2fa798bbd5029e6a4/synchronizer/synchronizer.go#L516

Implementation

Improve the way StatsHolder and Synchronizer handle concurrent read and write operations

tclemos commented 3 years ago

Currently, the synchronization module starts from the blocks after the creation of the smart contract and then goes block by block looking for smart contract logs in order to synchronize the data from L1 to L2.

The idea is to explore better the geth available resources like GraphQL or even filter a bunch of blocks at once instead of one by one in order to decrease the time needed to synchronize.

tclemos commented 3 years ago

In order to evaluate speed improvements and compare the different executions, we will use these fixed parameters:

Machine size: AWS c4.xlarge OS: Ubuntu 20.04 Hermez node version: 1.7.0-rc1 Geth version: 1.10.7 running at 127.0.0.1 Ethereum Network: Goerli - Internal Testnet Start Block: 4417197 Final Block: 5332486 Config:

[Synchronizer]
SyncLoopInterval = "100ms"
StatsUpdateBlockNumDiffThreshold = 1000
StatsUpdateFrequencyDivider = 1000

We will perform at least 3 full synchronizations and see how long it takes from the first block until the final block specified above.

In order to wipe the data in the node we will use the wipedbs command, like so:

heznode wipedbs --mode coord --cfg cfg.buidler.toml

in order to execute the node we will use the run along with sync mode, like so:

nohup heznode run --mode sync --cfg cfg.buidler.toml > hermez-node.log 2>&1 &
tclemos commented 3 years ago

1st Attempt: 5 hours, 34 minutes and 44 seconds

...
2021-08-18T12:21:05Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:554        Syncing...      {"block": 4417197, "ethLastBlock": {"Num":5341209,"Timestamp":"2021-08-18T12:20:55Z","Hash":"0xf105712e0ebc266c4e1d662d2e845dee127db7ea8951f01df5eaa44a62af031b"}}
2021-08-18T12:21:05Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:674        Synced block    {"syncLastBlockNum": 4417197, "syncBlocksPerc": 0.0001082235855989039, "ethLastBlockNum": 5341209}
...
2021-08-18T17:55:49Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:554        Syncing...      {"block": 5332486, "ethLastBlock": {"Num":5342547,"Timestamp":"2021-08-18T17:55:41Z","Hash":"0xf08d47d62d5fc297a065d40b568c0faf822ab55d28c5e89a261dfee33dfd0bf5"}}
2021-08-18T17:55:49Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:674        Synced block    {"syncLastBlockNum": 5332486, "syncBlocksPerc": 98.91273689659383, "ethLastBlockNum": 5342547}
...
tclemos commented 3 years ago

2nd Attempt: 5 hours, 44 minutes and 47 seconds

...
2021-08-18T18:23:05Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:554        Syncing...      {"block": 4417197, "ethLastBlock": {"Num":5342656,"Timestamp":"2021-08-18T18:22:57Z","Hash":"0x1fe6bf009878310a0ba881b4aaaa538e31f4271b738f6ad66131ccbf90be7025"}}
2021-08-18T18:23:05Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:674        Synced block    {"syncLastBlockNum": 4417197, "syncBlocksPerc": 0.00010805437296047371, "ethLastBlockNum": 5342656}
...
2021-08-19T00:07:52Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:554        Syncing...      {"block": 5332486, "ethLastBlock": {"Num":5344035,"Timestamp":"2021-08-19T00:07:42Z","Hash":"0x0b3923149c5dd3bcf92781d547ab324e6c25d33fd81fc2cab02fbc87bc396904"}}
2021-08-19T00:07:52Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:674        Synced block    {"syncLastBlockNum": 5332486, "syncBlocksPerc": 98.75393676787446, "ethLastBlockNum": 5344035}
...
tclemos commented 3 years ago

Geth exposes a GraphQL interface to provide informations from the Ethereum network

You can access the GraphQL interface of your node via browser accessing the /graphql/ui url of your geth node.

if it is not available, make sure you executed Geth with the flags to enable the GraphQL interface, like so:

--graphql --graphql.corsdomain '*' --graphql.vhosts '*'

This is the GraphQL query in order to get information from multiple blocks at once:

{
  blocks(from: 4417197, to: 4427197) {
    number
    hash
    parent {
    hash
    }
    timestamp
  }
}
jeffprestes commented 3 years ago

If you allow me to give a suggestion to help you @tclemos , this example below shows you can also refine the query results by contract rollup address and the event:

{
  blocks(from: 4417196, to: 5346878) {
    transactions {
      hash
      from {
        address
      }
      logs {
        account {
          address
        }
        topics
      }
    }
  }
  logs(filter: {fromBlock: 4417196, addresses: ["0xf08a226b67a8a9f99ccfcf51c50867bc18a54f53"], topics: [["0x7cae662d4cfa9d9c5575c65f0cc41a858c51ca14ebcbd02a802a62376c3ad238"]]}) {
    transaction {
      hash
      from {
        address
      }
      block {
        number
        timestamp
      }
    }
  }
}
tclemos commented 3 years ago

For sure, also Geth Ethereum Client has a method to query the logs by range, like this:

q := ethereum.FilterQuery{
    FromBlock: big.NewInt(fromBlock),
    ToBlock:   big.NewInt(toBlock),
    Addresses: []common.Address{
        rollUpAddress,
        actionAddress,
        withdrawalDelayerAddress,
    },
}

logs, err := client.FilterLogs(ctx, q)

But I couldn't find a method to query blocks by range, this is the reason I'm exploring the GraphQL interface.

jeffprestes commented 3 years ago

I see... we are talking about only GraphQL. BTW, I've made some adjustments. This is a query it's working perfectly and filtering logs of NewForge and ForgeBatch events:

{
  logs (filter: {fromBlock: 5307649, topics: [["0xe00040c8a3b0bf905636c26924e90520eafc5003324138236fddee2d34588618", "0x7cae662d4cfa9d9c5575c65f0cc41a858c51ca14ebcbd02a802a62376c3ad238"]]}) {
    account {
      address
    }
    transaction {
        hash
        from {
            address
        }
    }
    topics 
  }
} 

You can try it against your GETH node in Goerli and it should work. I've tried it successfully using @mfcastellani 's one.

jeffprestes commented 3 years ago

After our conversation, this is a query to filter the ForgeBatch event:

{
  logs(filter: {fromBlock: 4417196, topics: 
  [
    ["0x7cae662d4cfa9d9c5575c65f0cc41a858c51ca14ebcbd02a802a62376c3ad238"]
  ]}) {
    account {
      address
    }
    transaction {
      hash
      from {
        address
      }
      inputData
    }
    topics
  }
}
tclemos commented 3 years ago

3rd Attempt: 5 hours, 40 minutes and 9 seconds

...
2021-08-19T11:15:10Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:554        Syncing...      {"block": 4417197, "ethLastBlock": {"Num":5346704,"Timestamp":"2021-08-19T11:15:03Z","Hash":"0x5322e08a12457b480b0bfc06ed0e9c4c9283e793b5af1c83fa37f813b1398dce"}}
2021-08-19T11:15:10Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:674        Synced block    {"syncLastBlockNum": 4417197, "syncBlocksPerc": 0.00010758379701949848, "ethLastBlockNum": 5346704}
...
2021-08-19T16:55:19Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:554        Syncing...      {"block": 5332486, "ethLastBlock": {"Num":5348064,"Timestamp":"2021-08-19T16:55:04Z","Hash":"0x1dcc45166e8e8b64651c8a58b551cd31bd18c8f8c5c1572d548b5f2d75cf4b63"}}
2021-08-19T16:55:19Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:674        Synced block    {"syncLastBlockNum": 5332486, "syncBlocksPerc": 98.32650816227435, "ethLastBlockNum": 5348064}
...
tclemos commented 3 years ago

Summary:

1st Attempt: 20084 seconds (5 hours, 34 minutes, and 44 seconds) 2nd Attempt: 20687 seconds (5 hours, 44 minutes, and 47 seconds) 3rd Attempt: 20409 seconds (5 hours, 40 minutes, and 9 seconds)

Average: 20393 seconds (5 hours, 39 minutes, and 53 seconds)

tclemos commented 3 years ago

Attempt - Synchronizing only blocks with events

4 hours, 49 minutes and 11 seconds

IMPORTANT: pending data validation, we just synchronized the blocks but we didn't checked the integrity of the data synchronized

Since the block 5332486 does not have Hermez Smart Contract events, we will be using the closest blocks before and after with events.

The blocks verified here are

Last block before with events: 5308649 Ref block: 5332486 First block after with events: 5348449

How the code of this version works

Instead of running through all the blocks one by one, we try to find in the next 10000 blocks, the next block in the sequence with events in the RollUp, Auction, and Withdrawal Delayer contracts in order to sync only blocks that contain significant data to be synchronized. Once blocks with events are found, they are stored in memory until we synchronize them all and then start over the next cycle to find the next ones until the current block of the network chain. Only blocks with events in the contracts are synchronized, all the other blocks are ignored since there is no data to be synchronized.

Results

It's possible to see a slight increase in the speed of the synchronization process. We downed the amount of time from an average of 20393 seconds to 17351 seconds, which in short means we increased the speed in 15%.

Considerations

Since Hermez network is not running at its full capacity, we have a lot of ethereum blocks we can ignore, the blocks ignored tend to reduce over time with the adoption of the network.

Logs

...
2021-08-20T16:48:38Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:561        Syncing...      {"block": 4417197, "ethLastBlock": {"Num":5353797,"Timestamp":"2021-08-20T16:48:21Z","Hash":"0x76a2044a3315a9f7be3f5df0c5d80e3a2892d2518b237e6362c4e730652eac6c"}}
2021-08-20T16:48:38Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:681        Synced block    {"syncLastBlockNum": 4417197, "syncBlocksPerc": 0.00010676905106870481, "ethLastBlockNum": 5353797}
...
2021-08-20T21:37:49Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:561        Syncing...      {"block": 5308649, "ethLastBlock": {"Num":5354712,"Timestamp":"2021-08-20T20:37:06Z","Hash":"0xbf34567c33deb71428b76f71c1b936f8e943ba00fd68a8f5f649cb021d906cac"}}
2021-08-20T21:37:49Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:681        Synced block    {"syncLastBlockNum": 5308649, "syncBlocksPerc": 95.08669718703467, "ethLastBlockNum": 5354712}
...
2021-08-20T21:37:49Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:561        Syncing...      {"block": 5348449, "ethLastBlock": {"Num":5354712,"Timestamp":"2021-08-20T20:37:06Z","Hash":"0xbf34567c33deb71428b76f71c1b936f8e943ba00fd68a8f5f649cb021d906cac"}}
2021-08-20T21:37:49Z    ^[[35mDEBUG^[[0m        synchronizer/synchronizer.go:681        Synced block    {"syncLastBlockNum": 5348449, "syncBlocksPerc": 99.33195806791564, "ethLastBlockNum": 5354712}
...
tclemos commented 3 years ago

Continuing to explore the new strategy to sync blocks, we decided to move to the next step in order to guarantee the data synchronized is valid, which is forge a new batch.

The first attempt to forge a batch with this new version has failed, the node was not "triggering" the forging process. The reason behind this problem is that the node was developed to sync the whole chain without skipping any block, considering this, there is a check to define whether the node is synced or out of sync and then execute different actions accordingly to this check. Synced means, the last saved block in the node is equal to the last eth block, and since we are skipping the blocks without events if the last block in the chain does not contain an event, it was causing the node to be out of sync even though it was synced

Also, when a new block is synchronized, the synchronizer sends a message to the coordinator in order to update the node state, since we are skipping all the blocks without SC events in L1, if the node was restarted and no event happened since it was started, the state will not be updated with the next forgers data, because there is no block being synchronized.

In order to fix this, we must synchronize the last block in the chain every time we synchronize and there is no event between the last block saved and the last eth block in the chain, it works like this:

In summary, when synchronizing a lot of blocks, like starting a new node or starting the node after a period without sync, it is smart enough to skip all the unnecessary blocks in order to save time, after the whole chain is synchronized, it will synchronize all the next blocks to keep the node in Synced state and allow batches to be forged.

This is the batch I have forged with the version implementing the explanation above:

image

tclemos commented 3 years ago

PR merged #1113