status-im / infra-nimbus

Infrastructure for Nimbus cluster
https://nimbus.team
9 stars 6 forks source link

Run multiple nimbus-eth1 mainnet instances #193

Open tersec opened 1 month ago

tersec commented 1 month ago

Initially, these don't have to have validators attached to them, but function as a fourth backing EL in addition to Nethermind, Erigon, and Geth.

To facilitate syncing, it can be provided by a combination of era file syncing and/or a pre-prepared database synced close to current mainnet head.

arnetheduck commented 3 weeks ago

10 instances each for mainnet / holesky / sepolia - the database takes 2-3 weeks to create, so we'll pre-seed the nodes with a pre-prepared database copy

each instance needs about 300gb disk for the state - we should also think about setting it up in such a way that they have access to era1/era stores for historical block data (a single copy shared between the nodes)

jakubgs commented 3 weeks ago

From a conversation with Jacek we can start with a setup like this and then grow from there:

The priority is on deploying nimbus-eth1 nodes on mainnet network. first.

yakimant commented 2 weeks ago

nimbus.mainnet has enough space after re-sync, I will put it to the /docker volume together with geth.

❯ ansible -i ansible/inventory/test nimbus-mainnet-metal -a 'df -h /data /docker' -f1
linux-01.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        2.9T  1.5T  1.4T  52% /data
/dev/sdc        3.5T  1.4T  1.9T  43% /docker
linux-02.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        2.9T  1.5T  1.3T  55% /data
/dev/sdc        3.5T  1.4T  1.9T  43% /docker
linux-03.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        2.9T  950G  1.8T  35% /data
/dev/sdc        3.5T  1.4T  1.9T  43% /docker
linux-04.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        2.9T  1.1T  1.8T  38% /data
/dev/sdc        3.5T  1.4T  1.9T  43% /docker
linux-05.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        2.9T  943G  1.8T  34% /data
/dev/sdc        3.5T  1.4T  1.9T  43% /docker
linux-06.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        2.9T  946G  1.8T  34% /data
/dev/sdc        3.5T  1.4T  1.9T  43% /docker
linux-07.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        2.9T  1.1T  1.7T  41% /data
/dev/sdc        3.5T  1.4T  2.0T  41% /docker
yakimant commented 2 weeks ago

nimbus-eth1 is running on linux-01.ih-eu-mda1.nimbus.mainnet attached to it's beacon nodes.

Here is it's config template: https://github.com/status-im/infra-role-nimbus-eth1/blob/master/templates/nimbus-eth1.service.j2

Looks like it needs some additional configuration in regards of syncing (prepared database or era files).

Found other config options here: https://github.com/status-im/nimbus-eth1/blob/master/nimbus/config.nim

yakimant commented 2 weeks ago

We have those era files at the host:

❯ ls -1 /data/era/
mainnet-00000-4b363db9.era
...
mainnet-01198-7fa25a94.era

Shall I point nimbus-eth1 to it with --era-dir /data/era? Or i need to put them to data/shared_mainnet_0/era?

yakimant commented 2 weeks ago

FYI

Beacon node EL stats:

Screenshot 2024-08-28 at 17 27 27

Errors in nimbus-eth1 logs (/var/log/service/nimbus-eth1-mainnet-master/service.log):

DBG 2024-08-28 15:25:36.018+00:00 Discovery send failed                      topics="eth p2p discovery" msg="(97) Address family not supported by protocol"
...
ERR 2024-08-28 15:26:39.042+00:00 Unexpected exception in rlpxAccept         topics="eth p2p rlpx" exc=EthP2PError err="Eth handshake for different network"
...
WRN 2024-08-28 15:27:32.303+00:00 Error while handling RLPx message          topics="eth p2p rlpx" peer=Node[37.24.131.128:30306] msg=newBlockHashes err="block announcements disallowed"
...
ERR 2024-08-28 15:28:23.082+00:00 Unexpected exception in rlpxAccept         topics="eth p2p rlpx" exc=EthP2PError err="Eth handshake for different network"
...
WRN 2024-08-28 15:28:29.446+00:00 Error while handling RLPx message          topics="eth p2p rlpx" peer=Node[136.244.57.56:30345] msg=newBlock err="block broadcasts disallowed"

Metrics (curl -sSf http://0:9401/metrics | grep -v '#' | sort):

discv4_routing_table_nodes 8307.0
discv4_routing_table_nodes_created 1724854339.0
nec_import_block_number 0.0
nec_import_block_number_created 1724854339.0
nec_imported_blocks_created 1724854339.0
nec_imported_blocks_total 0.0
nec_imported_gas_created 1724854339.0
nec_imported_gas_total 0.0
nec_imported_transactions_created 1724854339.0
nec_imported_transactions_total 0.0
nim_gc_heap_instance_occupied_bytes{type_name="KeyValuePairSeq[desc_identifiers.RootedVertexID, desc_identifiers.HashKey]"} 2097184.0
nim_gc_heap_instance_occupied_bytes{type_name="KeyValuePairSeq[desc_identifiers.RootedVertexID, desc_structural.VertexRef]"} 1048608.0
nim_gc_heap_instance_occupied_bytes{type_name="KeyValuePairSeq[desc_identifiers.VertexID, KeyedQueueItem[desc_identifiers.VertexID, desc_identifiers.HashKey]]"} 1179680.0
nim_gc_heap_instance_occupied_bytes{type_name="KeyValuePairSeq[eth_types.EthAddress, chain_config.GenesisAccount]"} 1714336.0
nim_gc_heap_instance_occupied_bytes{type_name="KeyValuePairSeq[eth_types.Hash256, desc_structural.VertexRef]"} 1048608.0
nim_gc_heap_instance_occupied_bytes{type_name="Node"} 6927976.0
nim_gc_heap_instance_occupied_bytes{type_name="OrderedKeyValuePairSeq[kademlia.TimeKey, system.int64]"} 1310752.0
nim_gc_heap_instance_occupied_bytes{type_name="seq[byte]"} 10073653.0
nim_gc_heap_instance_occupied_bytes{type_name="seq[OutstandingRequest]"} 946176.0
nim_gc_heap_instance_occupied_bytes{type_name="VertexRef"} 3150992.0
nim_gc_heap_instance_occupied_summed_bytes 34260237.0
nim_gc_mem_bytes_created{thread_id="3337631"} 1724854350.0
nim_gc_mem_bytes{thread_id="3337631"} 81338368.0
nim_gc_mem_occupied_bytes_created{thread_id="3337631"} 1724854350.0
nim_gc_mem_occupied_bytes{thread_id="3337631"} 38001264.0
process_cpu_seconds_total 97.06
process_max_fds 1024.0
process_open_fds 56.0
process_resident_memory_bytes 137035776.0
process_start_time_seconds 1724854339.4
process_virtual_memory_bytes 1152454656.0
rlpx_accept_failure_created{reason=""} 1724854345.0
rlpx_accept_failure_created{reason="AlreadyConnected"} 1724854469.0
rlpx_accept_failure_created{reason="EthP2PError"} 1724854345.0
rlpx_accept_failure_created{reason="MessageTimeout"} 1724854975.0
rlpx_accept_failure_created{reason="P2PInternalError"} 1724855703.0
rlpx_accept_failure_created{reason="UselessPeerError"} 1724854459.0
rlpx_accept_failure_total{reason=""} 298.0
rlpx_accept_failure_total{reason="AlreadyConnected"} 119.0
rlpx_accept_failure_total{reason="EthP2PError"} 131.0
rlpx_accept_failure_total{reason="MessageTimeout"} 4.0
rlpx_accept_failure_total{reason="P2PInternalError"} 1.0
rlpx_accept_failure_total{reason="UselessPeerError"} 43.0
rlpx_accept_success_created 1724854339.0
rlpx_accept_success_total 117.0
rlpx_connected_peers 17.0
rlpx_connected_peers_created 1724854339.0
rlpx_connect_failure_created{reason=""} 1724854418.0
rlpx_connect_failure_created{reason="P2PHandshakeError"} 1724854418.0
rlpx_connect_failure_created{reason="ProtocolError"} 1724854418.0
rlpx_connect_failure_created{reason="RlpxHandshakeTransportError"} 1724854418.0
rlpx_connect_failure_created{reason="TransportConnectError"} 1724854418.0
rlpx_connect_failure_created{reason="UselessRlpxPeerError"} 1724854418.0
rlpx_connect_failure_total{reason=""} 37480.0
rlpx_connect_failure_total{reason="P2PHandshakeError"} 2021.0
rlpx_connect_failure_total{reason="ProtocolError"} 1465.0
rlpx_connect_failure_total{reason="RlpxHandshakeTransportError"} 33292.0
rlpx_connect_failure_total{reason="TransportConnectError"} 546.0
rlpx_connect_failure_total{reason="UselessRlpxPeerError"} 156.0
rlpx_connect_success_created 1724854339.0
rlpx_connect_success_total 180.0
arnetheduck commented 2 weeks ago

Errors in nimbus-eth1 logs:

cc @mjfh can you take a look at this?

see https://github.com/status-im/nimbus-eth2/blob/unstable/docs/logging.md for our logging levels - in particular, remote nodes doing strange things should never result in any logs above debug level - from the point of view of nimbus, it is "normal" for remote nodes to misbehave and we should have logic in place that deals with the misbehavior rather than raising the issue to the user via logs - ie these are expected conditions, that there exist nodes that do strange things so they are not errors, warnings or even info.

yakimant commented 2 weeks ago

Exporting era1 can be done like that:

sudo geth --datadir=/docker/geth-mainnet/node/data  --mainnet export-history /docker/era1 0 15537393

where 15537393 is tha last block before merge. See also:

yakimant commented 6 days ago

Shortcut for era1 files suggested by Jacek: https://era1.ethportal.net

Downloaded to: linux-01.ih-eu-mda1.nimbus.mainnet:/docker/era1

Checksums match the file they provide.

yakimant commented 3 days ago

Import from era files should be done like that I guess:

/docker/nimbus-eth1-mainnet-master/repo/build/nimbus import --era1-dir=/docker/era1 --era-dir=/data/era

With current speed it should take ~1h to import.

yakimant commented 3 days ago

RPC API doesn't show much:

❯ ./rpc.sh eth_syncing
{
  "jsonrpc": "2.0",
  "id": 1,
  "result": {
    "startingBlock": "0x0",
    "currentBlock": "0x0",
    "highestBlock": "0x0"
  }
}
tersec commented 3 days ago

Syntactically, this is a valid, if minimalistic, response: https://ethereum.org/en/developers/docs/apis/json-rpc/#eth_syncing

https://github.com/status-im/nimbus-eth1/blob/178d77ab310a79f3fa3a350d3546b607145a6aab/nimbus/core/chain/forked_chain.nim#L356-365 sets highestBlock:

proc setHead(c: ForkedChainRef,
             headHash: Hash256,
             number: BlockNumber) =
  # TODO: db.setHead should not read from db anymore
  # all canonical chain marking
  # should be done from here.
  discard c.db.setHead(headHash)

  # update global syncHighest
  c.com.syncHighest = number

but https://github.com/status-im/nimbus-eth1/blob/master/nimbus/nimbus_import.nim never calls setHead(...).

startingBlock is arguably correct.

currentBlock is internally syncCurrent, and is updated from nimbus import by its persistBlocks(...) call, but these syncCurrent/syncHighest/syncStartvariables basically only reflect the syncing happening at that time, nothing per se thenimbus importcommand did. When Nimbus is run after thenimbus import`, those are simply never changed from their defaults, because no syncing is happening.

However, it reports syncing because

  server.rpc("eth_syncing") do() -> SyncingStatus:
    ## Returns SyncObject or false when not syncing.
    # TODO: make sure we are not syncing
    # when we reach the recent block
    let numPeers = node.peerPool.connectedNodes.len
    if numPeers > 0:
      var sync = SyncObject(
        startingBlock: w3Qty com.syncStart,
        currentBlock : w3Qty com.syncCurrent,
        highestBlock : w3Qty com.syncHighest
      )
      result = SyncingStatus(syncing: true, syncObject: sync)
    else:
      result = SyncingStatus(syncing: false)

which isn't really correct. Having peers does not imply syncing.

So the issue is basically that it's not syncing, but falsely showing that it is syncing.

Its not syncing when connected to an EL is itself a bug but right now expected, known issue being addressed. I'm not sure I've seen the falsely-showing-syncing reported before.

tersec commented 3 days ago

https://github.com/status-im/nimbus-eth1/issues/2618

tersec commented 3 days ago

https://github.com/status-im/nimbus-eth1/pull/2619

yakimant commented 2 days ago

Progress