Set up a fluffy history bridge - Githubissues

status-im / infra-nimbus

Infrastructure for Nimbus cluster

https://nimbus.team

8 stars 6 forks source link

Set up a fluffy history bridge #182

Closed kdeme closed 3 days ago

kdeme commented 3 months ago

In order to gossip Ethereum chain history data into the Portal network we need to run the new fluffy portal_bridge on our infra.

Brief documentation here of the portal_bridge: https://fluffy.guide/history-content-bridging.html#seeding-history-data-with-the-portal_bridge

We basically want to set up:

Fluffy node with storage capacity 0, e.g.: ./build/fluffy --metrics --rpc --storage-capacity:0
Portal bridge injecting latest + audit + backfill from era1 files, e.g.: ./build/portal_bridge history --latest:true --backfill:true --audit:true --era1-dir:/somedir/era1/ --web3-url:${WEB3_URL}
Access to a running Ethereum full node / web3 provider. This can be the same node as required for the glados setup, see issue: https://github.com/status-im/infra-nimbus/issues/158
The portal_bridge also needs file access to all era1 files. These can be found here: https://era1.ethportal.net/

The era1 files take about 428GB of space. The portal_bridge itself does not require any additional storage space. And the fluffy node will store practically nothing in its database with a with storage capacity set to 0.

It would be good to have access to the metrics of the Fluffy node to see the gossip stats.

kdeme commented 3 months ago

cc @jakubgs

jakubgs commented 1 month ago

What network is this intended for? Mainnet?

jakubgs commented 1 month ago

Our current nodes run on something called testnet0: https://github.com/status-im/infra-nimbus/blob/9aa1a6cb2e77c483ca01121b1dab7eebeb6b3148/ansible/group_vars/nimbus.fluffy.yml#L3-L5

Appears this was changed by you:

infra-nimbus#e0140a14 - nimbus.fluffy: drop bootstrap nodes, use network flag

So it's a testnet I guess.

jakubgs commented 1 month ago

The fleet naming has been resolved via PRs from Deme:

jakubgs commented 1 month ago

Some notes after discussing this with Kim:

portal_bridge is a long-running service that will talk to two services and use era1 files:
- Fully synced EL node used by portal bridge via --web3-url flag.
- Fluffy node with --storage-capacity:0 to which portal bridge will use via --rpc-address.
- The era1 files from era1.ethportal.net which are an archive of EL blocks from before merge.
It's okay to re-use a Geth node from nimbus.mainnet fleet. Multi-EL might be supported later.
The Fluffy node will communicate with Portal Network and inject new and old block data.
Re-using existing nimbus.fluffy hosts is fine but we might have to migrate later.
- Much higher bandwidth usage can be expected.
The era1 files are static and are not expected to change.
This setup does not need to be highly available yet.
Fluffy node metrics should be collected.
Future plans include beacon network and state network bridges.

Looks like I will need an extra SSD for the era1 files.

jakubgs commented 1 month ago

Created a ticket to get an extra 500 GB SSD: https://client.innovahosting.net/viewticket.php?tid=428306&c=tkxOauXp

jakubgs commented 1 month ago

Indeed, it is about 458 GB in total:

 > c https://era1.ethportal.net/ | awk -F'[<> ]' '/kB<\/td>/{count = count + $7}END{print count}' 
458497023

jakubgs commented 1 month ago

Support responded:

I have sent an invoice for an 800GB SAS SSD, as we didn't have 500 GB.

Ah well.

jakubgs commented 1 month ago

Created separate repo for portal bridge:

infra-repos#93a061f2 - add infra-role-portal-bridge repo

jakubgs commented 1 month ago

Got the drive:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 physicaldrive all show 

Smart Array P440ar in Slot 0 (Embedded)

   Array A

      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS SSD, 400 GB, OK)

   Array B

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS SSD, 1.6 TB, OK)

   Unassigned

      physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS SSD, 800 GB, OK)

Created a logical volume for it:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 create type=ld drives=2I:1:6

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 logicaldrive all show

Smart Array P440ar in Slot 0 (Embedded)

   Array A

      logicaldrive 1 (372.58 GB, RAID 0, OK)

   Array B

      logicaldrive 2 (1.46 TB, RAID 0, OK)

   Array C

      logicaldrive 3 (745.19 GB, RAID 0, OK)

jakubgs commented 1 month ago

Mounted:

infra-nimbus#fc21ebbc - fluffy: mount second volume under /era

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % df -h /data /era
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        1.5T  1.2T  229G  84% /data
/dev/sdc        733G   28K  696G   1% /era

jakubgs commented 1 month ago

Started downloading of era1 files in a tmux session:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:/era % ERA1_URL=https://era1.ethportal.net/
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:/era % FILES=$(c "${ERA1_URL}" | awk -F'[<>]' '/<td><p /{print $5}')      
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:/era % for FILE in $(echo $FILES); do wget ${ERA1_URL}${FILE}; done

jakubgs commented 1 month ago

Have some of Ansible setup done:

infra-role-portal-bridge#3643347f - add role metadata
infra-role-portal-bridge#bfa508d2 - add initial version of role config

Needs some polish.

jakubgs commented 4 weeks ago

Requested extra storage for both hosts from Innova: https://client.innovahosting.net/viewticket.php?tid=142244&c=VBiP9aEZ

jakubgs commented 3 weeks ago

I got more storage:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 physicaldrive all show

Smart Array P440ar in Slot 0 (Embedded)

   Array A

      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS SSD, 400 GB, OK)

   Array B

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS SSD, 1.6 TB, OK)

   Array C

      physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS SSD, 800 GB, OK)

   Unassigned

      physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS SSD, 1.6 TB, OK)

Will attempt to perform a migration without losing data.

jakubgs commented 3 weeks ago

Combined two 1.6 TB volumes into one RAID0:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 logicaldrive 2 delete
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 create type=ld drives=1I:1:1,2I:1:7 raid=0
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 physicaldrive all show

Smart Array P440ar in Slot 0 (Embedded)

   Array A

      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS SSD, 400 GB, OK)

   Array B

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS SSD, 1.6 TB, OK)
      physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS SSD, 1.6 TB, OK)

   Array C

      physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS SSD, 800 GB, OK

Restoring data, will restart nodes in the morning.

jakubgs commented 3 weeks ago

Migration for metal-01.ih-eu-mda1.nimbus.fluffy complete:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:/etc/consul % df -h / /data /era
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       366G  338G  9.6G  98% /
/dev/sdb        2.9T  1.2T  1.7T  42% /data
/dev/sdc        733G  428G  269G  62% /era

jakubgs commented 3 weeks ago

I've combined the drives and started restoring the data on metal-02:

jakubgs@metal-02.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 logicaldrive 2 delete
jakubgs@metal-02.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 create type=ld drives=1I:1:1,2I:1:7 raid=0

Warning: SSD Over Provisioning Optimization will be performed on the physical
         drives in this array. This process may take a long time and cause this
         application to appear unresponsive.

jakubgs@metal-02.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 logicaldrive all show

Smart Array P440ar in Slot 0 (Embedded)

   Array A

      logicaldrive 1 (372.58 GB, RAID 0, OK)

   Array B

      logicaldrive 2 (2.91 TB, RAID 0, OK)

jakubgs@metal-02.ih-eu-mda1.nimbus.fluffy:~ % df -h /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        2.9T   28K  2.8T   1% /data

jakubgs commented 3 weeks ago

All nodes are back online with new storage on both hosts.

jakubgs commented 1 week ago

Some fixes to the service:

infra-role-portal-bridge#439c8751 - build: fix commit variable usage
infra-role-portal-bridge#cf1b3286 - service: fix format and order of portal_bridge flags
infra-role-portal-bridge#e5420ebc - defaults: fix backfill default to be false
infra-role-portal-bridge#d1772970 - service: fix variable for latest flag

And configuration of EL node:

infra-nimbus#5fc1d1f4 - requirements: include porta-bridge service fixes
infra-nimbus#17d767c5 - fluffy: deploy portal-bridge instance

Now we need a fluffy node to access the RPC port of the portal-bridge node. I assume that's the issue that causes these errors:

Failed to gossip receipts  error="JSON-RPC error: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockNumber=10292447

jakubgs commented 1 week ago

We are using Geth EL from linux-02 from nimbus.mainnet via VPN:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % s cat nimbus-portal-bridge | grep url
  --web3-url=https://linux-02.ih-eu-mda1.nimbus.mainnet.wg:8545 \

Which is accessible:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo nmap -Pn -p8545 linux-02.ih-eu-mda1.nimbus.mainnet.wg
Starting Nmap 7.80 ( https://nmap.org ) at 2024-06-28 12:01 UTC
Nmap scan report for linux-02.ih-eu-mda1.nimbus.mainnet.wg (10.14.0.115)
Host is up (0.00034s latency).

PORT     STATE SERVICE
8545/tcp open  unknown

Nmap done: 1 IP address (1 host up) scanned in 0.13 seconds

So I'm not sure why this error would happen:

JSON-RPC error: Failed to send POST Request with JSON-RPC: Could not connect to remote host

jakubgs commented 1 week ago

Turns out these --rpc-* flags are not for listening but for connecting to the fluffy node we will be running:

portal_bridge_44deff9b [OPTIONS]... command

The following options are available:

 --log-level        Sets the log level [=INFO].
 --rpc-address      Listening address of the Portal JSON-RPC server [=127.0.0.1].
 --rpc-port         Listening port of the Portal JSON-RPC server [=8545].

jakubgs commented 3 days ago

Turns out I accidentally used HTTPS and not HTTP for --web3-url, which resulted in errors like this:

Failed to send POST Request with JSON-RPC: Could not connect to remote host, reason:
    (UnsupportedVersion) Incoming protocol or record version is unsupported (code: 3)"

So I fixed that:

infra-nimbus#a9918989 - portal-bridge: move config to separate vars file
infra-nimbus#e2c8ce01 - portal-bridge: fix web3 URL to not use HTTPS

But now the errors are:

ERR 2024-07-02 08:49:01.827+00:00 Failed to gossip block header              error="JSON-RPC error: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockNumber=12900810
ERR 2024-07-02 08:49:01.829+00:00 Failed to gossip block body                error="JSON-RPC error: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockNumber=12900810
ERR 2024-07-02 08:49:01.832+00:00 Failed to gossip receipts                  error="JSON-RPC error: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockNumber=12900810

But port is clearly open:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % nmap -Pn -p8545 linux-02.ih-eu-mda1.nimbus.mainnet.wg     
Starting Nmap 7.80 ( https://nmap.org ) at 2024-07-02 08:49 UTC
Nmap scan report for linux-02.ih-eu-mda1.nimbus.mainnet.wg (10.14.0.115)
Host is up (0.00037s latency).

PORT     STATE SERVICE
8545/tcp open  unknown

Nmap done: 1 IP address (1 host up) scanned in 0.03 seconds

But this is probably due to wrong value for rpc-address and rpc-port.

kdeme commented 3 days ago

@jakubgs This error is related to the JSON-RPC interface pf the Fluffy node. The web3 interface appears to be working fine now.

By the way, I have just merged https://github.com/status-im/nimbus-eth1/pull/2437 which changes the cli option to --portal-rpc-url similar like --web3-url.

This is the value that gets set for the FLuffy node by the options:

 --rpc-port                HTTP port for the JSON-RPC server [=8545].
 --rpc-address             Listening address of the RPC server [=127.0.0.1].

jakubgs commented 3 days ago

Thanks, I am deploying a fluffy node right now. Thanks for updating the flags, will use the new format.

jakubgs commented 3 days ago

Done:

infra-role-portal-bridge#4f72fd21 - service: use new --portal-rpc-url flag
infra-role-portal-bridge#4fbac9de - consul: change check to script from tcp
infra-nimbus#95def753 - portal: add open-ports to expose metrics
infra-nimbus#99163a81 - portal: open listening port for fluffy node
infra-nimbus#bedcd886 - portal: use new portal-rpc-url flag

It works:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:/etc/consul % s cat nimbus-portal-bridge-history | grep portal-rpc-url
  --portal-rpc-url=http://127.0.0:19900 \

jakubgs commented 3 days ago

The fluffy node appears to be running fine:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % tail -n5 /var/log/service/nimbus-portal-bridge-fluffy/service.log
INF 2024-07-02 14:26:51.175+00:00 Database pruning attempt resulted in no content deleted
INF 2024-07-02 14:26:51.175+00:00 Received offered content validated successfully topics="portal_hist" contentKey=023e1cd20eed09607e98c88827136f6e5d556a9ea6b63614291c8b83959b63d62e
INF 2024-07-02 14:27:39.023+00:00 History network status                     topics="portal_hist" radius=0% dbSize=3973kb routingTableNodes=79
INF 2024-07-02 14:28:39.024+00:00 History network status                     topics="portal_hist" radius=0% dbSize=3973kb routingTableNodes=79
INF 2024-07-02 14:29:39.025+00:00 History network status                     topics="portal_hist" radius=0% dbSize=3973kb routingTableNodes=79

Not sure how to confirm its health. I don't see anything weird on the dashboard, but then again I'm not sure what too look for: https://metrics.status.im/d/iWQQPuPnkadsf/nimbus-fluffy-dashboard?orgId=1&refresh=5s&var-instance=metal-01.ih-eu-mda1.nimbus.fluffy&var-container=nimbus-portal-bridge-fluffy

But despite the node being up and listening on port 19900:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo netstat -lpnt | grep 19900
tcp        0      0 127.0.0.1:19900         0.0.0.0:*               LISTEN      1453731/fluffy

The portal-bridge history node is still throwing errors about RPC connection:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % tail -n3 /var/log/service/nimbus-portal-bridge-history/service.log
ERR 2024-07-02 14:34:58.215+00:00 Failed to gossip block header              error="JSON-RPC portal_historyGossip failed: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockHash=7720492a79ee29f955e4e68a11dd2a6ff8f22d8909e146ebd5b3ca8de70c62fb blockNumber=4899534
ERR 2024-07-02 14:34:58.216+00:00 Failed to gossip block body                error="JSON-RPC portal_historyGossip failed: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockHash=7720492a79ee29f955e4e68a11dd2a6ff8f22d8909e146ebd5b3ca8de70c62fb blockNumber=4899534
ERR 2024-07-02 14:34:58.217+00:00 Failed to gossip receipts                  error="JSON-RPC portal_historyGossip failed: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockHash=7720492a79ee29f955e4e68a11dd2a6ff8f22d8909e146ebd5b3ca8de70c62fb blockNumber=4899534

Specifically with:

JSON-RPC portal_historyGossip failed: Failed to send POST Request with JSON-RPC: Could not connect to remote host

jakubgs commented 3 days ago

Okay, I see the issue, I ate the last .1 in the address:

--portal-rpc-url=http://127.0.0:19900

Fixed:

infra-nimbus#d5ad598a - portal: fix portal RPC address to fluffy node

And it's running:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % tail -n4 /var/log/service/nimbus-portal-bridge-history/service.log
WRN 2024-07-02 14:39:30.001+00:00 Block gossip took longer than slot interval
INF 2024-07-02 14:39:31.910+00:00 Retrieved block header from Portal network blockHash=b51e1a8db0222b1f7f79b29168f95b69f0728f2036bd1f8e3e4baf017c5e0207 blockNumber=4477800
INF 2024-07-02 14:39:31.934+00:00 Retrieved block body from Portal network   blockNumber=4477800
INF 2024-07-02 14:39:32.978+00:00 Retrieved block receipts from Portal network blockNumber=447780

jakubgs commented 3 days ago

The setup can be seen in the ansible/vars/portal-bridge.yml file: https://github.com/status-im/infra-nimbus/blob/d5ad598a3ace9b6e986ee3fb9b6266e6eb9269a9/ansible/vars/portal-bridge.yml#L2-L26

There are 3 nodes involved:

geth-mainnet-node@linux-02.ih-eu-mda1.nimbus.mainnet - EL geth node used via web3-url flag.
nimbus-portal-bridge-fluffy-metrics@metal-01.ih-eu-mda1.nimbus.fluffy - Fluffy node used via portal-rpc-url.
nimbus-portal-bridge-history@metal-01.ih-eu-mda1.nimbus.fluffy - Portal bridge node using above nodes.

The metrics for the fluffy node can be found in Grafana: https://metrics.status.im/d/iWQQPuPnkadsf/nimbus-fluffy-dashboard?orgId=1&var-instance=metal-01.ih-eu-mda1.nimbus.fluffy&var-container=nimbus-portal-bridge-fluffy

The logs show the portal bridge node is gossiping:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % grep gossip /var/log/service/nimbus-portal-bridge-history/service.log | tail -n5
INF 2024-07-02 14:40:44.395+00:00 Block body gossiped                        peers=8 contentKey=01409746073717fe48bf995ded2ddda19b2b8449a31d92ed3a82c6a5ca4e512524
INF 2024-07-02 14:40:45.456+00:00 Receipts gossiped                          peers=8 contentKey=02409746073717fe48bf995ded2ddda19b2b8449a31d92ed3a82c6a5ca4e512524
INF 2024-07-02 14:40:53.756+00:00 Block header gossiped                      peers=8 contentKey=006dd1c2a3e825b5be39a4ca85f36b09655969507d7593b51677ea05a294443aab
INF 2024-07-02 14:40:56.516+00:00 Block body gossiped                        peers=8 contentKey=016dd1c2a3e825b5be39a4ca85f36b09655969507d7593b51677ea05a294443aab
INF 2024-07-02 14:40:56.989+00:00 Receipts gossiped                          peers=8 contentKey=026dd1c2a3e825b5be39a4ca85f36b09655969507d7593b51677ea05a294443aab

I consider this task completed.