status-im / infra-nimbus

Infrastructure for Nimbus cluster
https://nimbus.team
8 stars 6 forks source link

Set up a fluffy history bridge #182

Closed kdeme closed 3 days ago

kdeme commented 3 months ago

In order to gossip Ethereum chain history data into the Portal network we need to run the new fluffy portal_bridge on our infra.

Brief documentation here of the portal_bridge: https://fluffy.guide/history-content-bridging.html#seeding-history-data-with-the-portal_bridge

We basically want to set up:

The era1 files take about 428GB of space. The portal_bridge itself does not require any additional storage space. And the fluffy node will store practically nothing in its database with a with storage capacity set to 0.

It would be good to have access to the metrics of the Fluffy node to see the gossip stats.

kdeme commented 3 months ago

cc @jakubgs

jakubgs commented 1 month ago

What network is this intended for? Mainnet?

jakubgs commented 1 month ago

Our current nodes run on something called testnet0: https://github.com/status-im/infra-nimbus/blob/9aa1a6cb2e77c483ca01121b1dab7eebeb6b3148/ansible/group_vars/nimbus.fluffy.yml#L3-L5

Appears this was changed by you:

So it's a testnet I guess.

jakubgs commented 1 month ago

The fleet naming has been resolved via PRs from Deme:

jakubgs commented 1 month ago

Some notes after discussing this with Kim:

Looks like I will need an extra SSD for the era1 files.

jakubgs commented 1 month ago

Created a ticket to get an extra 500 GB SSD: https://client.innovahosting.net/viewticket.php?tid=428306&c=tkxOauXp

jakubgs commented 1 month ago

Indeed, it is about 458 GB in total:

 > c https://era1.ethportal.net/ | awk -F'[<> ]' '/kB<\/td>/{count = count + $7}END{print count}' 
458497023
jakubgs commented 1 month ago

Support responded:

I have sent an invoice for an 800GB SAS SSD, as we didn't have 500 GB.

Ah well.

jakubgs commented 1 month ago

Created separate repo for portal bridge:

jakubgs commented 1 month ago

Got the drive:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 physicaldrive all show 

Smart Array P440ar in Slot 0 (Embedded)

   Array A

      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS SSD, 400 GB, OK)

   Array B

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS SSD, 1.6 TB, OK)

   Unassigned

      physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS SSD, 800 GB, OK)

Created a logical volume for it:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 create type=ld drives=2I:1:6
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 logicaldrive all show

Smart Array P440ar in Slot 0 (Embedded)

   Array A

      logicaldrive 1 (372.58 GB, RAID 0, OK)

   Array B

      logicaldrive 2 (1.46 TB, RAID 0, OK)

   Array C

      logicaldrive 3 (745.19 GB, RAID 0, OK)
jakubgs commented 1 month ago

Mounted:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % df -h /data /era
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        1.5T  1.2T  229G  84% /data
/dev/sdc        733G   28K  696G   1% /era
jakubgs commented 1 month ago

Started downloading of era1 files in a tmux session:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:/era % ERA1_URL=https://era1.ethportal.net/
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:/era % FILES=$(c "${ERA1_URL}" | awk -F'[<>]' '/<td><p /{print $5}')      
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:/era % for FILE in $(echo $FILES); do wget ${ERA1_URL}${FILE}; done
jakubgs commented 1 month ago

Have some of Ansible setup done:

Needs some polish.

jakubgs commented 4 weeks ago

Requested extra storage for both hosts from Innova: https://client.innovahosting.net/viewticket.php?tid=142244&c=VBiP9aEZ

jakubgs commented 3 weeks ago

I got more storage:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 physicaldrive all show

Smart Array P440ar in Slot 0 (Embedded)

   Array A

      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS SSD, 400 GB, OK)

   Array B

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS SSD, 1.6 TB, OK)

   Array C

      physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS SSD, 800 GB, OK)

   Unassigned

      physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS SSD, 1.6 TB, OK)

Will attempt to perform a migration without losing data.

jakubgs commented 3 weeks ago

Combined two 1.6 TB volumes into one RAID0:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 logicaldrive 2 delete
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 create type=ld drives=1I:1:1,2I:1:7 raid=0
jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 physicaldrive all show

Smart Array P440ar in Slot 0 (Embedded)

   Array A

      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS SSD, 400 GB, OK)

   Array B

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS SSD, 1.6 TB, OK)
      physicaldrive 2I:1:7 (port 2I:box 1:bay 7, SAS SSD, 1.6 TB, OK)

   Array C

      physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS SSD, 800 GB, OK

Restoring data, will restart nodes in the morning.

jakubgs commented 3 weeks ago

Migration for metal-01.ih-eu-mda1.nimbus.fluffy complete:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:/etc/consul % df -h / /data /era
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       366G  338G  9.6G  98% /
/dev/sdb        2.9T  1.2T  1.7T  42% /data
/dev/sdc        733G  428G  269G  62% /era
jakubgs commented 3 weeks ago

I've combined the drives and started restoring the data on metal-02:

jakubgs@metal-02.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 logicaldrive 2 delete
jakubgs@metal-02.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 create type=ld drives=1I:1:1,2I:1:7 raid=0

Warning: SSD Over Provisioning Optimization will be performed on the physical
         drives in this array. This process may take a long time and cause this
         application to appear unresponsive.

jakubgs@metal-02.ih-eu-mda1.nimbus.fluffy:~ % sudo ssacli ctrl slot=0 logicaldrive all show

Smart Array P440ar in Slot 0 (Embedded)

   Array A

      logicaldrive 1 (372.58 GB, RAID 0, OK)

   Array B

      logicaldrive 2 (2.91 TB, RAID 0, OK)

jakubgs@metal-02.ih-eu-mda1.nimbus.fluffy:~ % df -h /data
Filesystem      Size  Used Avail Use% Mounted on
/dev/sdb        2.9T   28K  2.8T   1% /data
jakubgs commented 3 weeks ago

All nodes are back online with new storage on both hosts.

jakubgs commented 1 week ago

Some fixes to the service:

And configuration of EL node:

Now we need a fluffy node to access the RPC port of the portal-bridge node. I assume that's the issue that causes these errors:

Failed to gossip receipts  error="JSON-RPC error: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockNumber=10292447
jakubgs commented 1 week ago

We are using Geth EL from linux-02 from nimbus.mainnet via VPN:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % s cat nimbus-portal-bridge | grep url
  --web3-url=https://linux-02.ih-eu-mda1.nimbus.mainnet.wg:8545 \

Which is accessible:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo nmap -Pn -p8545 linux-02.ih-eu-mda1.nimbus.mainnet.wg
Starting Nmap 7.80 ( https://nmap.org ) at 2024-06-28 12:01 UTC
Nmap scan report for linux-02.ih-eu-mda1.nimbus.mainnet.wg (10.14.0.115)
Host is up (0.00034s latency).

PORT     STATE SERVICE
8545/tcp open  unknown

Nmap done: 1 IP address (1 host up) scanned in 0.13 seconds

So I'm not sure why this error would happen:

JSON-RPC error: Failed to send POST Request with JSON-RPC: Could not connect to remote host
jakubgs commented 1 week ago

Turns out these --rpc-* flags are not for listening but for connecting to the fluffy node we will be running:

portal_bridge_44deff9b [OPTIONS]... command

The following options are available:

 --log-level        Sets the log level [=INFO].
 --rpc-address      Listening address of the Portal JSON-RPC server [=127.0.0.1].
 --rpc-port         Listening port of the Portal JSON-RPC server [=8545].
jakubgs commented 3 days ago

Turns out I accidentally used HTTPS and not HTTP for --web3-url, which resulted in errors like this:

Failed to send POST Request with JSON-RPC: Could not connect to remote host, reason:
    (UnsupportedVersion) Incoming protocol or record version is unsupported (code: 3)"

So I fixed that:

But now the errors are:

ERR 2024-07-02 08:49:01.827+00:00 Failed to gossip block header              error="JSON-RPC error: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockNumber=12900810
ERR 2024-07-02 08:49:01.829+00:00 Failed to gossip block body                error="JSON-RPC error: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockNumber=12900810
ERR 2024-07-02 08:49:01.832+00:00 Failed to gossip receipts                  error="JSON-RPC error: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockNumber=12900810

But port is clearly open:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % nmap -Pn -p8545 linux-02.ih-eu-mda1.nimbus.mainnet.wg     
Starting Nmap 7.80 ( https://nmap.org ) at 2024-07-02 08:49 UTC
Nmap scan report for linux-02.ih-eu-mda1.nimbus.mainnet.wg (10.14.0.115)
Host is up (0.00037s latency).

PORT     STATE SERVICE
8545/tcp open  unknown

Nmap done: 1 IP address (1 host up) scanned in 0.03 seconds

But this is probably due to wrong value for rpc-address and rpc-port.

kdeme commented 3 days ago

@jakubgs This error is related to the JSON-RPC interface pf the Fluffy node. The web3 interface appears to be working fine now.

By the way, I have just merged https://github.com/status-im/nimbus-eth1/pull/2437 which changes the cli option to --portal-rpc-url similar like --web3-url.

This is the value that gets set for the FLuffy node by the options:

 --rpc-port                HTTP port for the JSON-RPC server [=8545].
 --rpc-address             Listening address of the RPC server [=127.0.0.1].
jakubgs commented 3 days ago

Thanks, I am deploying a fluffy node right now. Thanks for updating the flags, will use the new format.

jakubgs commented 3 days ago

Done:

It works:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:/etc/consul % s cat nimbus-portal-bridge-history | grep portal-rpc-url
  --portal-rpc-url=http://127.0.0:19900 \
jakubgs commented 3 days ago

The fluffy node appears to be running fine:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % tail -n5 /var/log/service/nimbus-portal-bridge-fluffy/service.log
INF 2024-07-02 14:26:51.175+00:00 Database pruning attempt resulted in no content deleted
INF 2024-07-02 14:26:51.175+00:00 Received offered content validated successfully topics="portal_hist" contentKey=023e1cd20eed09607e98c88827136f6e5d556a9ea6b63614291c8b83959b63d62e
INF 2024-07-02 14:27:39.023+00:00 History network status                     topics="portal_hist" radius=0% dbSize=3973kb routingTableNodes=79
INF 2024-07-02 14:28:39.024+00:00 History network status                     topics="portal_hist" radius=0% dbSize=3973kb routingTableNodes=79
INF 2024-07-02 14:29:39.025+00:00 History network status                     topics="portal_hist" radius=0% dbSize=3973kb routingTableNodes=79

Not sure how to confirm its health. I don't see anything weird on the dashboard, but then again I'm not sure what too look for: https://metrics.status.im/d/iWQQPuPnkadsf/nimbus-fluffy-dashboard?orgId=1&refresh=5s&var-instance=metal-01.ih-eu-mda1.nimbus.fluffy&var-container=nimbus-portal-bridge-fluffy

But despite the node being up and listening on port 19900:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % sudo netstat -lpnt | grep 19900
tcp        0      0 127.0.0.1:19900         0.0.0.0:*               LISTEN      1453731/fluffy 

The portal-bridge history node is still throwing errors about RPC connection:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % tail -n3 /var/log/service/nimbus-portal-bridge-history/service.log
ERR 2024-07-02 14:34:58.215+00:00 Failed to gossip block header              error="JSON-RPC portal_historyGossip failed: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockHash=7720492a79ee29f955e4e68a11dd2a6ff8f22d8909e146ebd5b3ca8de70c62fb blockNumber=4899534
ERR 2024-07-02 14:34:58.216+00:00 Failed to gossip block body                error="JSON-RPC portal_historyGossip failed: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockHash=7720492a79ee29f955e4e68a11dd2a6ff8f22d8909e146ebd5b3ca8de70c62fb blockNumber=4899534
ERR 2024-07-02 14:34:58.217+00:00 Failed to gossip receipts                  error="JSON-RPC portal_historyGossip failed: Failed to send POST Request with JSON-RPC: Could not connect to remote host" blockHash=7720492a79ee29f955e4e68a11dd2a6ff8f22d8909e146ebd5b3ca8de70c62fb blockNumber=4899534

Specifically with:

JSON-RPC portal_historyGossip failed: Failed to send POST Request with JSON-RPC: Could not connect to remote host
jakubgs commented 3 days ago

Okay, I see the issue, I ate the last .1 in the address:

--portal-rpc-url=http://127.0.0:19900

Fixed:

And it's running:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % tail -n4 /var/log/service/nimbus-portal-bridge-history/service.log
WRN 2024-07-02 14:39:30.001+00:00 Block gossip took longer than slot interval
INF 2024-07-02 14:39:31.910+00:00 Retrieved block header from Portal network blockHash=b51e1a8db0222b1f7f79b29168f95b69f0728f2036bd1f8e3e4baf017c5e0207 blockNumber=4477800
INF 2024-07-02 14:39:31.934+00:00 Retrieved block body from Portal network   blockNumber=4477800
INF 2024-07-02 14:39:32.978+00:00 Retrieved block receipts from Portal network blockNumber=447780
jakubgs commented 3 days ago

The setup can be seen in the ansible/vars/portal-bridge.yml file: https://github.com/status-im/infra-nimbus/blob/d5ad598a3ace9b6e986ee3fb9b6266e6eb9269a9/ansible/vars/portal-bridge.yml#L2-L26

There are 3 nodes involved:

The metrics for the fluffy node can be found in Grafana: https://metrics.status.im/d/iWQQPuPnkadsf/nimbus-fluffy-dashboard?orgId=1&var-instance=metal-01.ih-eu-mda1.nimbus.fluffy&var-container=nimbus-portal-bridge-fluffy

The logs show the portal bridge node is gossiping:

jakubgs@metal-01.ih-eu-mda1.nimbus.fluffy:~ % grep gossip /var/log/service/nimbus-portal-bridge-history/service.log | tail -n5
INF 2024-07-02 14:40:44.395+00:00 Block body gossiped                        peers=8 contentKey=01409746073717fe48bf995ded2ddda19b2b8449a31d92ed3a82c6a5ca4e512524
INF 2024-07-02 14:40:45.456+00:00 Receipts gossiped                          peers=8 contentKey=02409746073717fe48bf995ded2ddda19b2b8449a31d92ed3a82c6a5ca4e512524
INF 2024-07-02 14:40:53.756+00:00 Block header gossiped                      peers=8 contentKey=006dd1c2a3e825b5be39a4ca85f36b09655969507d7593b51677ea05a294443aab
INF 2024-07-02 14:40:56.516+00:00 Block body gossiped                        peers=8 contentKey=016dd1c2a3e825b5be39a4ca85f36b09655969507d7593b51677ea05a294443aab
INF 2024-07-02 14:40:56.989+00:00 Receipts gossiped                          peers=8 contentKey=026dd1c2a3e825b5be39a4ca85f36b09655969507d7593b51677ea05a294443aab

I consider this task completed.