Open tersec opened 1 month ago
10 instances each for mainnet / holesky / sepolia - the database takes 2-3 weeks to create, so we'll pre-seed the nodes with a pre-prepared database copy
each instance needs about 300gb disk for the state - we should also think about setting it up in such a way that they have access to era1/era stores for historical block data (a single copy shared between the nodes)
From a conversation with Jacek we can start with a setup like this and then grow from there:
mainnet
- Extend storage on all 7 hosts and add one nimbus-eth1
node on each host, attached to all BNs on it.sepolia
- Extend storage on the host and add four nimbus-eth1
nodes, for each of the BNs on it.holesky
- Replace Erigon EL nodes with nimbus-eth
nodes on all 10 erigon-01
hosts.The priority is on deploying nimbus-eth1
nodes on mainnet
network. first.
nimbus.mainnet
has enough space after re-sync, I will put it to the /docker
volume together with geth
.
❯ ansible -i ansible/inventory/test nimbus-mainnet-metal -a 'df -h /data /docker' -f1
linux-01.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 2.9T 1.5T 1.4T 52% /data
/dev/sdc 3.5T 1.4T 1.9T 43% /docker
linux-02.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 2.9T 1.5T 1.3T 55% /data
/dev/sdc 3.5T 1.4T 1.9T 43% /docker
linux-03.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 2.9T 950G 1.8T 35% /data
/dev/sdc 3.5T 1.4T 1.9T 43% /docker
linux-04.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 2.9T 1.1T 1.8T 38% /data
/dev/sdc 3.5T 1.4T 1.9T 43% /docker
linux-05.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 2.9T 943G 1.8T 34% /data
/dev/sdc 3.5T 1.4T 1.9T 43% /docker
linux-06.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 2.9T 946G 1.8T 34% /data
/dev/sdc 3.5T 1.4T 1.9T 43% /docker
linux-07.ih-eu-mda1.nimbus.mainnet | CHANGED | rc=0 >>
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 2.9T 1.1T 1.7T 41% /data
/dev/sdc 3.5T 1.4T 2.0T 41% /docker
nimbus-eth1
is running on linux-01.ih-eu-mda1.nimbus.mainnet
attached to it's beacon nodes.
Here is it's config template: https://github.com/status-im/infra-role-nimbus-eth1/blob/master/templates/nimbus-eth1.service.j2
Looks like it needs some additional configuration in regards of syncing (prepared database or era files).
Found other config options here: https://github.com/status-im/nimbus-eth1/blob/master/nimbus/config.nim
We have those era files at the host:
❯ ls -1 /data/era/
mainnet-00000-4b363db9.era
...
mainnet-01198-7fa25a94.era
Shall I point nimbus-eth1
to it with --era-dir /data/era
?
Or i need to put them to data/shared_mainnet_0/era
?
FYI
Beacon node EL stats:
Errors in nimbus-eth1
logs (/var/log/service/nimbus-eth1-mainnet-master/service.log
):
DBG 2024-08-28 15:25:36.018+00:00 Discovery send failed topics="eth p2p discovery" msg="(97) Address family not supported by protocol"
...
ERR 2024-08-28 15:26:39.042+00:00 Unexpected exception in rlpxAccept topics="eth p2p rlpx" exc=EthP2PError err="Eth handshake for different network"
...
WRN 2024-08-28 15:27:32.303+00:00 Error while handling RLPx message topics="eth p2p rlpx" peer=Node[37.24.131.128:30306] msg=newBlockHashes err="block announcements disallowed"
...
ERR 2024-08-28 15:28:23.082+00:00 Unexpected exception in rlpxAccept topics="eth p2p rlpx" exc=EthP2PError err="Eth handshake for different network"
...
WRN 2024-08-28 15:28:29.446+00:00 Error while handling RLPx message topics="eth p2p rlpx" peer=Node[136.244.57.56:30345] msg=newBlock err="block broadcasts disallowed"
Metrics (curl -sSf http://0:9401/metrics | grep -v '#' | sort
):
discv4_routing_table_nodes 8307.0
discv4_routing_table_nodes_created 1724854339.0
nec_import_block_number 0.0
nec_import_block_number_created 1724854339.0
nec_imported_blocks_created 1724854339.0
nec_imported_blocks_total 0.0
nec_imported_gas_created 1724854339.0
nec_imported_gas_total 0.0
nec_imported_transactions_created 1724854339.0
nec_imported_transactions_total 0.0
nim_gc_heap_instance_occupied_bytes{type_name="KeyValuePairSeq[desc_identifiers.RootedVertexID, desc_identifiers.HashKey]"} 2097184.0
nim_gc_heap_instance_occupied_bytes{type_name="KeyValuePairSeq[desc_identifiers.RootedVertexID, desc_structural.VertexRef]"} 1048608.0
nim_gc_heap_instance_occupied_bytes{type_name="KeyValuePairSeq[desc_identifiers.VertexID, KeyedQueueItem[desc_identifiers.VertexID, desc_identifiers.HashKey]]"} 1179680.0
nim_gc_heap_instance_occupied_bytes{type_name="KeyValuePairSeq[eth_types.EthAddress, chain_config.GenesisAccount]"} 1714336.0
nim_gc_heap_instance_occupied_bytes{type_name="KeyValuePairSeq[eth_types.Hash256, desc_structural.VertexRef]"} 1048608.0
nim_gc_heap_instance_occupied_bytes{type_name="Node"} 6927976.0
nim_gc_heap_instance_occupied_bytes{type_name="OrderedKeyValuePairSeq[kademlia.TimeKey, system.int64]"} 1310752.0
nim_gc_heap_instance_occupied_bytes{type_name="seq[byte]"} 10073653.0
nim_gc_heap_instance_occupied_bytes{type_name="seq[OutstandingRequest]"} 946176.0
nim_gc_heap_instance_occupied_bytes{type_name="VertexRef"} 3150992.0
nim_gc_heap_instance_occupied_summed_bytes 34260237.0
nim_gc_mem_bytes_created{thread_id="3337631"} 1724854350.0
nim_gc_mem_bytes{thread_id="3337631"} 81338368.0
nim_gc_mem_occupied_bytes_created{thread_id="3337631"} 1724854350.0
nim_gc_mem_occupied_bytes{thread_id="3337631"} 38001264.0
process_cpu_seconds_total 97.06
process_max_fds 1024.0
process_open_fds 56.0
process_resident_memory_bytes 137035776.0
process_start_time_seconds 1724854339.4
process_virtual_memory_bytes 1152454656.0
rlpx_accept_failure_created{reason=""} 1724854345.0
rlpx_accept_failure_created{reason="AlreadyConnected"} 1724854469.0
rlpx_accept_failure_created{reason="EthP2PError"} 1724854345.0
rlpx_accept_failure_created{reason="MessageTimeout"} 1724854975.0
rlpx_accept_failure_created{reason="P2PInternalError"} 1724855703.0
rlpx_accept_failure_created{reason="UselessPeerError"} 1724854459.0
rlpx_accept_failure_total{reason=""} 298.0
rlpx_accept_failure_total{reason="AlreadyConnected"} 119.0
rlpx_accept_failure_total{reason="EthP2PError"} 131.0
rlpx_accept_failure_total{reason="MessageTimeout"} 4.0
rlpx_accept_failure_total{reason="P2PInternalError"} 1.0
rlpx_accept_failure_total{reason="UselessPeerError"} 43.0
rlpx_accept_success_created 1724854339.0
rlpx_accept_success_total 117.0
rlpx_connected_peers 17.0
rlpx_connected_peers_created 1724854339.0
rlpx_connect_failure_created{reason=""} 1724854418.0
rlpx_connect_failure_created{reason="P2PHandshakeError"} 1724854418.0
rlpx_connect_failure_created{reason="ProtocolError"} 1724854418.0
rlpx_connect_failure_created{reason="RlpxHandshakeTransportError"} 1724854418.0
rlpx_connect_failure_created{reason="TransportConnectError"} 1724854418.0
rlpx_connect_failure_created{reason="UselessRlpxPeerError"} 1724854418.0
rlpx_connect_failure_total{reason=""} 37480.0
rlpx_connect_failure_total{reason="P2PHandshakeError"} 2021.0
rlpx_connect_failure_total{reason="ProtocolError"} 1465.0
rlpx_connect_failure_total{reason="RlpxHandshakeTransportError"} 33292.0
rlpx_connect_failure_total{reason="TransportConnectError"} 546.0
rlpx_connect_failure_total{reason="UselessRlpxPeerError"} 156.0
rlpx_connect_success_created 1724854339.0
rlpx_connect_success_total 180.0
Errors in nimbus-eth1 logs:
cc @mjfh can you take a look at this?
see https://github.com/status-im/nimbus-eth2/blob/unstable/docs/logging.md for our logging levels - in particular, remote nodes doing strange things should never result in any logs above debug level - from the point of view of nimbus, it is "normal" for remote nodes to misbehave and we should have logic in place that deals with the misbehavior rather than raising the issue to the user via logs - ie these are expected conditions, that there exist nodes that do strange things so they are not errors, warnings or even info.
Exporting era1 can be done like that:
sudo geth --datadir=/docker/geth-mainnet/node/data --mainnet export-history /docker/era1 0 15537393
where 15537393
is tha last block before merge.
See also:
Shortcut for era1 files suggested by Jacek: https://era1.ethportal.net
Downloaded to:
linux-01.ih-eu-mda1.nimbus.mainnet:/docker/era1
Checksums match the file they provide.
Import from era files should be done like that I guess:
/docker/nimbus-eth1-mainnet-master/repo/build/nimbus import --era1-dir=/docker/era1 --era-dir=/data/era
With current speed it should take ~1h to import.
RPC API doesn't show much:
❯ ./rpc.sh eth_syncing
{
"jsonrpc": "2.0",
"id": 1,
"result": {
"startingBlock": "0x0",
"currentBlock": "0x0",
"highestBlock": "0x0"
}
}
Syntactically, this is a valid, if minimalistic, response: https://ethereum.org/en/developers/docs/apis/json-rpc/#eth_syncing
https://github.com/status-im/nimbus-eth1/blob/178d77ab310a79f3fa3a350d3546b607145a6aab/nimbus/core/chain/forked_chain.nim#L356-365 sets highestBlock
:
proc setHead(c: ForkedChainRef,
headHash: Hash256,
number: BlockNumber) =
# TODO: db.setHead should not read from db anymore
# all canonical chain marking
# should be done from here.
discard c.db.setHead(headHash)
# update global syncHighest
c.com.syncHighest = number
but https://github.com/status-im/nimbus-eth1/blob/master/nimbus/nimbus_import.nim never calls setHead(...)
.
startingBlock
is arguably correct.
currentBlock
is internally syncCurrent
, and is updated from nimbus import
by its persistBlocks(...)
call, but these syncCurrent
/syncHighest
/syncStartvariables basically only reflect the syncing happening at that time, nothing per se the
nimbus importcommand did. When Nimbus is run after the
nimbus import`, those are simply never changed from their defaults, because no syncing is happening.
However, it reports syncing because
server.rpc("eth_syncing") do() -> SyncingStatus:
## Returns SyncObject or false when not syncing.
# TODO: make sure we are not syncing
# when we reach the recent block
let numPeers = node.peerPool.connectedNodes.len
if numPeers > 0:
var sync = SyncObject(
startingBlock: w3Qty com.syncStart,
currentBlock : w3Qty com.syncCurrent,
highestBlock : w3Qty com.syncHighest
)
result = SyncingStatus(syncing: true, syncObject: sync)
else:
result = SyncingStatus(syncing: false)
which isn't really correct. Having peers does not imply syncing.
So the issue is basically that it's not syncing, but falsely showing that it is syncing.
Its not syncing when connected to an EL is itself a bug but right now expected, known issue being addressed. I'm not sure I've seen the falsely-showing-syncing reported before.
Progress
Initially, these don't have to have validators attached to them, but function as a fourth backing EL in addition to Nethermind, Erigon, and Geth.
To facilitate syncing, it can be provided by a combination of era file syncing and/or a pre-prepared database synced close to current mainnet head.