Closed kdeme closed 3 days ago
cc @jakubgs
What network is this intended for? Mainnet?
Our current nodes run on something called testnet0
:
https://github.com/status-im/infra-nimbus/blob/9aa1a6cb2e77c483ca01121b1dab7eebeb6b3148/ansible/group_vars/nimbus.fluffy.yml#L3-L5
Appears this was changed by you:
infra-nimbus#e0140a14
- nimbus.fluffy: drop bootstrap nodes, use network flagSo it's a testnet I guess.
The fleet naming has been resolved via PRs from Deme:
Some notes after discussing this with Kim:
portal_bridge
is a long-running service that will talk to two services and use era1
files:
--web3-url
flag. --storage-capacity:0
to which portal bridge will use via --rpc-address
.era1
files from era1.ethportal.net which are an archive of EL blocks from before merge.nimbus.mainnet
fleet. Multi-EL might be supported later.nimbus.fluffy
hosts is fine but we might have to migrate later.
era1
files are static and are not expected to change.Looks like I will need an extra SSD for the era1
files.
Created a ticket to get an extra 500 GB SSD: https://client.innovahosting.net/viewticket.php?tid=428306&c=tkxOauXp
Indeed, it is about 458 GB in total:
> c https://era1.ethportal.net/ | awk -F'[<> ]' '/kB<\/td>/{count = count + $7}END{print count}'
458497023
Support responded:
I have sent an invoice for an 800GB SAS SSD, as we didn't have 500 GB.
Ah well.
Created separate repo for portal bridge:
infra-repos#93a061f2
- add infra-role-portal-bridge repoGot the drive:
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 physicaldrive all show
Smart Array P440ar in Slot 0 (Embedded)
Array A
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS SSD, 400 GB, OK)
Array B
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS SSD, 1.6 TB, OK)
Unassigned
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS SSD, 800 GB, OK)
Created a logical volume for it:
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 create type=ld drives=2I:1:6
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 logicaldrive all show
Smart Array P440ar in Slot 0 (Embedded)
Array A
logicaldrive 1 (372.58 GB, RAID 0, OK)
Array B
logicaldrive 2 (1.46 TB, RAID 0, OK)
Array C
logicaldrive 3 (745.19 GB, RAID 0, OK)
Mounted:
infra-nimbus#fc21ebbc
- fluffy: mount second volume under /erajakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % df -h /data /era
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 1.5T 1.2T 229G 84% /data
/dev/sdc 733G 28K 696G 1% /era
Started downloading of era1
files in a tmux
session:
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:/era % ERA1_URL=https://era1.ethportal.net/
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:/era % FILES=$(c "${ERA1_URL}" | awk -F'[<>]' '/<td><p /{print $5}')
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:/era % for FILE in $(echo $FILES); do wget ${ERA1_URL}${FILE}; done
Have some of Ansible setup done:
infra-role-portal-bridge#3643347f
- add role metadatainfra-role-portal-bridge#bfa508d2
- add initial version of role configNeeds some polish.
Requested extra storage for both hosts from Innova: https://client.innovahosting.net/viewticket.php?tid=142244&c=VBiP9aEZ
I got more storage:
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 physicaldrive all show
Smart Array P440ar in Slot 0 (Embedded)
Array A
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS SSD, 400 GB, OK)
Array B
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS SSD, 1.6 TB, OK)
Array C
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS SSD, 800 GB, OK)
Unassigned
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS SSD, 1.6 TB, OK)
Will attempt to perform a migration without losing data.
Combined two 1.6 TB volumes into one RAID0:
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 logicaldrive 2 delete
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 create type=ld drives=1I:1:1,2I:1:7 raid=0
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 physicaldrive all show
Smart Array P440ar in Slot 0 (Embedded)
Array A
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS SSD, 400 GB, OK)
Array B
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS SSD, 1.6 TB, OK)
physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS SSD, 1.6 TB, OK)
Array C
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS SSD, 800 GB, OK
Restoring data, will restart nodes in the morning.
Migration for metal-01.ih-eu-mda1.nimbus.fluffy
complete:
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:/etc/consul % df -h / /data /era
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 366G 338G 9.6G 98% /
/dev/sdb 2.9T 1.2T 1.7T 42% /data
/dev/sdc 733G 428G 269G 62% /era
I've combined the drives and started restoring the data on metal-02
:
jakubgs@metal-02.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 logicaldrive 2 delete
jakubgs@metal-02.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 create type=ld drives=1I:1:1,2I:1:7 raid=0
Warning: SSD Over Provisioning Optimization will be performed on the physical
drives in this array. This process may take a long time and cause this
application to appear unresponsive.
jakubgs@metal-02.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 logicaldrive all show
Smart Array P440ar in Slot 0 (Embedded)
Array A
logicaldrive 1 (372.58 GB, RAID 0, OK)
Array B
logicaldrive 2 (2.91 TB, RAID 0, OK)
jakubgs@metal-02.ih-eu-mda1.nimbus.fluffy:~ % df -h /data
Filesystem Size Used Avail Use% Mounted on
/dev/sdb 2.9T 28K 2.8T 1% /data
All nodes are back online with new storage on both hosts.
Some fixes to the service:
infra-role-portal-bridge#439c8751
- build: fix commit variable usageinfra-role-portal-bridge#cf1b3286
- service: fix format and order of portal_bridge flagsinfra-role-portal-bridge#e5420ebc
- defaults: fix backfill default to be falseinfra-role-portal-bridge#d1772970
- service: fix variable for latest flagAnd configuration of EL node:
infra-nimbus#5fc1d1f4
- requirements: include porta-bridge service fixesinfra-nimbus#17d767c5
- fluffy: deploy portal-bridge instanceNow we need a fluffy node to access the RPC port of the portal-bridge
node. I assume that's the issue that causes these errors:
Failed to gossip receipts error="JSON-RPC error: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockNumber=10292447
We are using Geth EL from linux-02
from nimbus.mainnet
via VPN:
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % s cat nimbus-portal-bridge | grep url
--web3-url=https://linux-02.ih-eu-mda1.nimbus.mainnet.wg:8545 \
Which is accessible:
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo nmap -Pn -p8545 linux-02.ih-eu-mda1.nimbus.mainnet.wg
Starting Nmap 7.80 ( https://nmap.org ) at 2024-06-28 12:01 UTC
Nmap scan report for linux-02.ih-eu-mda1.nimbus.mainnet.wg (10.14.0.115)
Host is up (0.00034s latency).
PORT STATE SERVICE
8545/tcp open unknown
Nmap done: 1 IP address (1 host up) scanned in 0.13 seconds
So I'm not sure why this error would happen:
JSON-RPC error: Failed to send POST Request with JSON-RPC: Could not connect to remote host
Turns out these --rpc-*
flags are not for listening but for connecting to the fluffy node we will be running:
portal_bridge_44deff9b [OPTIONS]... command
The following options are available:
--log-level Sets the log level [=INFO].
--rpc-address Listening address of the Portal JSON-RPC server [=127.0.0.1].
--rpc-port Listening port of the Portal JSON-RPC server [=8545].
Turns out I accidentally used HTTPS and not HTTP for --web3-url
, which resulted in errors like this:
Failed to send POST Request with JSON-RPC: Could not connect to remote host, reason:
(UnsupportedVersion) Incoming protocol or record version is unsupported (code: 3)"
So I fixed that:
infra-nimbus#a9918989
- portal-bridge: move config to separate vars fileinfra-nimbus#e2c8ce01
- portal-bridge: fix web3 URL to not use HTTPSBut now the errors are:
ERR 2024-07-02 08:49:01.827+00:00 Failed to gossip block header error="JSON-RPC error: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockNumber=12900810
ERR 2024-07-02 08:49:01.829+00:00 Failed to gossip block body error="JSON-RPC error: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockNumber=12900810
ERR 2024-07-02 08:49:01.832+00:00 Failed to gossip receipts error="JSON-RPC error: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockNumber=12900810
But port is clearly open:
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % nmap -Pn -p8545 linux-02.ih-eu-mda1.nimbus.mainnet.wg
Starting Nmap 7.80 ( https://nmap.org ) at 2024-07-02 08:49 UTC
Nmap scan report for linux-02.ih-eu-mda1.nimbus.mainnet.wg (10.14.0.115)
Host is up (0.00037s latency).
PORT STATE SERVICE
8545/tcp open unknown
Nmap done: 1 IP address (1 host up) scanned in 0.03 seconds
But this is probably due to wrong value for rpc-address
and rpc-port
.
@jakubgs This error is related to the JSON-RPC interface pf the Fluffy node. The web3 interface appears to be working fine now.
By the way, I have just merged https://github.com/status-im/nimbus-eth1/pull/2437 which changes the cli option to --portal-rpc-url
similar like --web3-url
.
This is the value that gets set for the FLuffy node by the options:
--rpc-port HTTP port for the JSON-RPC server [=8545].
--rpc-address Listening address of the RPC server [=127.0.0.1].
Thanks, I am deploying a fluffy node right now. Thanks for updating the flags, will use the new format.
Done:
infra-role-portal-bridge#4f72fd21
- service: use new --portal-rpc-url flaginfra-role-portal-bridge#4fbac9de
- consul: change check to script from tcpinfra-nimbus#95def753
- portal: add open-ports to expose metricsinfra-nimbus#99163a81
- portal: open listening port for fluffy nodeinfra-nimbus#bedcd886
- portal: use new portal-rpc-url flagIt works:
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:/etc/consul % s cat nimbus-portal-bridge-history | grep portal-rpc-url
--portal-rpc-url=http://127.0.0:19900 \
The fluffy node appears to be running fine:
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % tail -n5 /var/log/service/nimbus-portal-bridge-fluffy/service.log
INF 2024-07-02 14:26:51.175+00:00 Database pruning attempt resulted in no content deleted
INF 2024-07-02 14:26:51.175+00:00 Received offered content validated successfully topics="portal_hist" contentKey=023e1cd20eed09607e98c88827136f6e5d556a9ea6b63614291c8b83959b63d62e
INF 2024-07-02 14:27:39.023+00:00 History network status topics="portal_hist" radius=0% dbSize=3973kb routingTableNodes=79
INF 2024-07-02 14:28:39.024+00:00 History network status topics="portal_hist" radius=0% dbSize=3973kb routingTableNodes=79
INF 2024-07-02 14:29:39.025+00:00 History network status topics="portal_hist" radius=0% dbSize=3973kb routingTableNodes=79
Not sure how to confirm its health. I don't see anything weird on the dashboard, but then again I'm not sure what too look for: https://metrics.status.im/d/iWQQPuPnkadsf/nimbus-fluffy-dashboard?orgId=1&refresh=5s&var-instance=metal-01.ih-eu-mda1.nimbus.fluffy&var-container=nimbus-portal-bridge-fluffy
But despite the node being up and listening on port 19900
:
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo netstat -lpnt | grep 19900
tcp 0 0 127.0.0.1:19900 0.0.0.0:* LISTEN 1453731/fluffy
The portal-bridge
history node is still throwing errors about RPC connection:
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % tail -n3 /var/log/service/nimbus-portal-bridge-history/service.log
ERR 2024-07-02 14:34:58.215+00:00 Failed to gossip block header error="JSON-RPC portal_historyGossip failed: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockHash=7720492a79ee29f955e4e68a11dd2a6ff8f22d8909e146ebd5b3ca8de70c62fb blockNumber=4899534
ERR 2024-07-02 14:34:58.216+00:00 Failed to gossip block body error="JSON-RPC portal_historyGossip failed: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockHash=7720492a79ee29f955e4e68a11dd2a6ff8f22d8909e146ebd5b3ca8de70c62fb blockNumber=4899534
ERR 2024-07-02 14:34:58.217+00:00 Failed to gossip receipts error="JSON-RPC portal_historyGossip failed: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockHash=7720492a79ee29f955e4e68a11dd2a6ff8f22d8909e146ebd5b3ca8de70c62fb blockNumber=4899534
Specifically with:
JSON-RPC portal_historyGossip failed: Failed to send POST Request with JSON-RPC: Could not connect to remote host
Okay, I see the issue, I ate the last .1
in the address:
--portal-rpc-url=http://127.0.0:19900
Fixed:
infra-nimbus#d5ad598a
- portal: fix portal RPC address to fluffy nodeAnd it's running:
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % tail -n4 /var/log/service/nimbus-portal-bridge-history/service.log
WRN 2024-07-02 14:39:30.001+00:00 Block gossip took longer than slot interval
INF 2024-07-02 14:39:31.910+00:00 Retrieved block header from Portal network blockHash=b51e1a8db0222b1f7f79b29168f95b69f0728f2036bd1f8e3e4baf017c5e0207 blockNumber=4477800
INF 2024-07-02 14:39:31.934+00:00 Retrieved block body from Portal network blockNumber=4477800
INF 2024-07-02 14:39:32.978+00:00 Retrieved block receipts from Portal network blockNumber=447780
The setup can be seen in the ansible/vars/portal-bridge.yml
file:
https://github.com/status-im/infra-nimbus/blob/d5ad598a3ace9b6e986ee3fb9b6266e6eb9269a9/ansible/vars/portal-bridge.yml#L2-L26
There are 3 nodes involved:
geth-mainnet-node
@linux-02.ih-eu-mda1.nimbus.mainnet
- EL geth
node used via web3-url
flag.nimbus-portal-bridge-fluffy-metrics
@metal-01.ih-eu-mda1.nimbus.fluffy
- Fluffy node used via portal-rpc-url
.nimbus-portal-bridge-history
@metal-01.ih-eu-mda1.nimbus.fluffy
- Portal bridge node using above nodes.The metrics for the fluffy node can be found in Grafana: https://metrics.status.im/d/iWQQPuPnkadsf/nimbus-fluffy-dashboard?orgId=1&var-instance=metal-01.ih-eu-mda1.nimbus.fluffy&var-container=nimbus-portal-bridge-fluffy
The logs show the portal bridge node is gossiping:
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % grep gossip /var/log/service/nimbus-portal-bridge-history/service.log | tail -n5
INF 2024-07-02 14:40:44.395+00:00 Block body gossiped peers=8 contentKey=01409746073717fe48bf995ded2ddda19b2b8449a31d92ed3a82c6a5ca4e512524
INF 2024-07-02 14:40:45.456+00:00 Receipts gossiped peers=8 contentKey=02409746073717fe48bf995ded2ddda19b2b8449a31d92ed3a82c6a5ca4e512524
INF 2024-07-02 14:40:53.756+00:00 Block header gossiped peers=8 contentKey=006dd1c2a3e825b5be39a4ca85f36b09655969507d7593b51677ea05a294443aab
INF 2024-07-02 14:40:56.516+00:00 Block body gossiped peers=8 contentKey=016dd1c2a3e825b5be39a4ca85f36b09655969507d7593b51677ea05a294443aab
INF 2024-07-02 14:40:56.989+00:00 Receipts gossiped peers=8 contentKey=026dd1c2a3e825b5be39a4ca85f36b09655969507d7593b51677ea05a294443aab
I consider this task completed.
In order to gossip Ethereum chain history data into the Portal network we need to run the new fluffy
portal_bridge
on our infra.Brief documentation here of the
portal_bridge
: https://fluffy.guide/history-content-bridging.html#seeding-history-data-with-the-portal_bridgeWe basically want to set up:
./build/fluffy --metrics --rpc --storage-capacity:0
./build/portal_bridge history --latest:true --backfill:true --audit:true --era1-dir:/somedir/era1/ --web3-url:${WEB3_URL}
portal_bridge
also needs file access to all era1 files. These can be found here: https://era1.ethportal.net/The era1 files take about 428GB of space. The portal_bridge itself does not require any additional storage space. And the fluffy node will store practically nothing in its database with a with storage capacity set to 0.
It would be good to have access to the metrics of the Fluffy node to see the gossip stats.