status-im / infra-status-legacy

Infrastructure for old Status fleet
https://github.com/status-im/nim-waku
1 stars 3 forks source link

Waku v1 <> v2 bridge deployment #4

Closed jm-clius closed 2 years ago

jm-clius commented 2 years ago

The Waku v2 integration effort into Status requires deployment of a bridge between Waku v1 and Waku v2. nim-waku has a beta version of such a bridge which can be deployed. It helps to see the bridge as a stripped down version of both a Waku v1 and a Waku v2 client. It supports many of the same configuration options as both, usually with a -v1 or -v2 suffix to differentiate between version specific config. A tutorial for bridge installation is included in the nim-waku docs.

What should be deployed?

A bridge that connects to the Waku v1 prod fleet and the (soon to be deployed?) status.prod fleet. This will automatically start bridging messages between the two networks. The bridge should be built (make bridge) off the latest nim-waku master. To connect to a v1 network, the bridge can use either the --fleet-v1 option, e.g.

./build/wakubridge --fleet-v1:test

or a direct staticnode config, e.g.

./build/wakubridge --staticnode-v1:<enode-url>

To connect to a v2 network the bridge uses --staticnode-v2 with the multiaddr of a peer inside the desired v2 network:

./build/wakubridge --staticnode-v2:<peer-multiaddr>

NOTE: The bridge does not (yet) support Waku v2 Admin API functionality (e.g. post_waku_v2_admin_v1_peers to connect the bridge to v2 nodes via API call), neither will it persist its own peers after a restart. I therefore recommend to rather connect the "normal" nodes in the v2 fleet to the bridge than vice versa. These nodes can be connected to the bridge's multiaddr using post_waku_v2_admin_v1_peers and will keep this connection alive and will reconnect to the bridge after a restart.

What other config does the bridge support?

The bridge's v1 and v2 keys can be set using --nodekey-v1:<v1-private-key-as-hex> --nodekey-v2:<v2-private-key-as-hex>. The bridge supports v1 RPC calls and v2 debug, relay, filter and store RPC calls. The basic health-checks using RPC should therefore be possible on a bridge.

jakubgs commented 2 years ago

I've built and pushed an image for the bridge using the Dockerfile from nim-waku repo called:

 > d run --rm -it statusteam/nim-waku:deploy-bridge-test --help
Usage: 

wakubridge [OPTIONS]...

The following options are available:

 --log-level               Sets the log level [=LogLevel.INFO].
...

https://hub.docker.com/r/statusteam/nim-waku/tags https://ci.status.im/job/nim-waku/job/deploy-bridge-test/

And made a PR to rename the bridge target to wakubridge to work with the Dockerfile: https://github.com/status-im/nim-waku/pull/886

jakubgs commented 2 years ago

One question: Does the bridge any other protocol flags other than --relay?

jm-clius commented 2 years ago

One question: Does the bridge any other protocol flags other than --relay?

No, only relay necessary for now (which should be true by default in any case, but good to be explicit :) )

jakubgs commented 2 years ago

I see no --dns4-domain-name flag, but there is a boolean --dns-addrs, so I'm not sure how that's supposed to work:

 --dns-addrs               Enable resolution of `dnsaddr`, `dns4` or `dns6`
                               multiaddrs [=true].
 --dns-addrs-name-server   DNS name server IPs to query for DNS multiaddrs
                               resolution. Argument may be repeated.
                               [=@[ValidIpAddress.init("1.1.1.1"),
                               ValidIpAddress.init("1.0.0.1")]].
jakubgs commented 2 years ago

I've deployed the hosts: https://github.com/status-im/infra-status/commit/ac96e859 And configured the nodes: https://github.com/status-im/infra-status/commit/a95bc519

admin@bridge-01.do-ams3.status.test:/docker/nim-waku-bridge % dc ps
     Name                    Command               State                                            Ports                                          
---------------------------------------------------------------------------------------------------------------------------------------------------
nim-waku-bridge   /usr/bin/wakunode --log-le ...   Up      0.0.0.0:30303->30303/tcp, 60000/tcp, 0.0.0.0:8008->8008/tcp, 127.0.0.1:8545->8545/tcp,  
                                                           0.0.0.0:9000->9000/tcp 

But it sure is spamming the logs a lot with:

NOT 2022-03-10 12:42:50.888+00:00 No peers for topic, skipping publish       topics="libp2p gossipsub" tid=1 peersOnTopic=0 connectedPeers=0 topic=/waku/2/default-waku/proto
admin@bridge-01.do-ams3.status.test:~ % grep 'No peers' /var/log/docker/nim-waku-bridge/docker.log | wc -l
1441

That's quite a lot for just 4 minutes of node running and a NOTIFY level message.

Next step is to connect the peers.

jm-clius commented 2 years ago

Thanks, Jakub. Yes, that log should disappear once we have v2 peers connected.

Unfortunately it's logged on notice level and in an underlying library, so no straightforward way for us to suppress it (other than having the v2 peers connected.)

jakubgs commented 2 years ago

Some more changes, had to rename Ansible groups a bit to simplify things:

But now we have less config files.

jakubgs commented 2 years ago

Since I needed some way to connect both fleet peers and the bridge to the fleet I've extracted the logic for connecting peers into a separate Ansible role and implemented most of the logic in Python, since it's faster and easier to modify:

jakubgs commented 2 years ago

And it appears to work as expected:

2022-03-10 16:24:30,669 [INFO] Connecting to Consul: localhost:8500
2022-03-10 16:24:30,675 [INFO] Found 5 data centers.
2022-03-10 16:24:30,678 [DEBUG] Service: bridge-01.do-ams3.status.test (env:status,stage:test,nim,waku,bridge)
2022-03-10 16:24:31,060 [INFO] Found 1 services.
2022-03-10 16:24:31,060 [INFO] Calling JSON RPC: localhost:8545
2022-03-10 16:24:31,066 [INFO] SUCCESS

Change: https://github.com/status-im/infra-status/commit/b2142fd8

jakubgs commented 2 years ago

Except I don't see the bridge peer in the list of connected peers for the nodes:

admin@bridge-01.do-ams3.status.test:~ % /docker/nim-waku-bridge/rpc.sh get_waku_v2_debug_v1_info | jq -c .result.listenAddresses  
["/ip4/0.0.0.0/tcp/9000/p2p/16Uiu2HAmLwrpAgicPqsNtGprzGVudfFAPFT9iQ972MtVDcpn4Ucx"]
admin@node-01.gc-us-central1-a.status.test:~ % /docker/nim-waku/rpc.sh get_waku_v2_admin_v1_peers | jq '.result[].multiaddr'
"/ip4/47.242.233.36/tcp/30303/p2p/16Uiu2HAm2BjXxCp1sYFJQKpLLbPbwd5juxbsYofu3TsS3auvT9Yi"
"/ip4/64.225.81.237/tcp/30303/p2p/16Uiu2HAkukebeXjTQ9QDBeNDWuGfbaSg79wkkhK4vPocLgR6QFDf"
jakubgs commented 2 years ago

Oh, I see what's happening, I didn't extract the enode into the Consul service definition correctly:

admin@bridge-01.do-ams3.status.test:~ % sudo jq '.services[0].meta' /etc/consul/service_nim_waku_bridge.json
{
  "node_enode": "unknown"
}

Fixed in: https://github.com/status-im/infra-status/commit/74f5ff8b

jakubgs commented 2 years ago

But wait a second, the get_waku_v2_debug_v1_info call on the bridge returns multiaddress with 0.0.0.0 as IP.

admin@bridge-01.do-ams3.status.test:/docker/nim-waku-bridge % ./rpc.sh get_waku_v2_debug_v1_info | jq -c .result.listenAddresses
["/ip4/0.0.0.0/tcp/9000/p2p/16Uiu2HAmLwrpAgicPqsNtGprzGVudfFAPFT9iQ972MtVDcpn4Ucx"]
admin@bridge-01.do-ams3.status.test:/docker/nim-waku-bridge % grep extip docker-compose.yml                              
      --nat=extip:134.209.133.76

And I'm clearly setting extip in the --nat flag. @jm-clius any ideas?

jakubgs commented 2 years ago

Fixed by just replacing the 0.0.0.0 string with the proper IP for now: https://github.com/status-im/infra-status/commit/6e200169 https://github.com/status-im/infra-status/blob/6e200169fd5005cadce3b9c8432fe3ffdc274a4e/ansible/roles/nim-waku-bridge/tasks/query.yml#L28-L32

jakubgs commented 2 years ago

Ok, now it looks like they are connecting:

admin@node-01.do-ams3.status.test:/docker/nim-waku % /docker/nim-waku/rpc.sh get_waku_v2_admin_v1_peers | jq '.result[].multiaddr'
"/dns4/node-01.gc-us-central1-a.status.test.statusim.net/tcp/30303/p2p/16Uiu2HAmGDX3iAFox93PupVYaHa88kULGqMpJ7AEHGwj3jbMtt76"
"/ip4/134.209.133.76/tcp/9000/p2p/16Uiu2HAmLwrpAgicPqsNtGprzGVudfFAPFT9iQ972MtVDcpn4Ucx"
"/dns4/node-01.ac-cn-hongkong-c.status.test.statusim.net/tcp/30303/p2p/16Uiu2HAm2BjXxCp1sYFJQKpLLbPbwd5juxbsYofu3TsS3auvT9Yi"

Also improved a bit the connection script:

jakubgs commented 2 years ago

And the logs look healthier now too:

DBG 2022-03-10 17:44:05.858+00:00 Incoming WakuRelay connection              topics="wakurelay" tid=1
DBG 2022-03-10 17:44:05.858+00:00 starting pubsub read loop                  topics="libp2p pubsubpeer" tid=1 conn=16U*bMtt76:622a38e55379b808262b98d1 peer=16U*bMtt76 closed=false
DBG 2022-03-10 17:44:12.028+00:00 Incoming WakuRelay connection              topics="wakurelay" tid=1
DBG 2022-03-10 17:44:12.029+00:00 starting pubsub read loop                  topics="libp2p pubsubpeer" tid=1 conn=16U*uvT9Yi:622a38eb5379b808262b98d2 peer=16U*uvT9Yi closed=false

@jm-clius Though I do wonder why I'm seeing debug messages when my log level is info:

admin@bridge-01.do-ams3.status.test:/docker/nim-waku-bridge % grep log-level docker-compose.yml
      --log-level=info
jakubgs commented 2 years ago

Also moved bridge setup to before node setup, since otherwise it makes no sense: https://github.com/status-im/infra-status/commit/5c37c818

jakubgs commented 2 years ago

Ok, prod is connected too:

admin@bridge-01.do-ams3.status.prod:~ % sudo jq '.services[0].meta' /etc/consul/service_nim_waku_bridge.json
{
  "node_enode": "/ip4/161.35.244.35/tcp/9000/p2p/16Uiu2HAm1JGyYjjraM95y9wK4WFjg7k79H1xAGWGU8FTXXczjcbW"
}
admin@node-01.do-ams3.status.prod:~ % /docker/nim-waku/rpc.sh get_waku_v2_admin_v1_peers | jq '.result[].multiaddr'
"/dns4/node-02.gc-us-central1-a.status.prod.statusim.net/tcp/30303/p2p/16Uiu2HAmDQugwDHM3YeUp86iGjrUvbdw3JPRgikC7YoGBsT2ymMg"
"/dns4/node-01.ac-cn-hongkong-c.status.prod.statusim.net/tcp/30303/p2p/16Uiu2HAkvEZgh3KLwhLwXg95e5ojM8XykJ4Kxi2T7hk22rnA7pJC"
"/dns4/node-02.ac-cn-hongkong-c.status.prod.statusim.net/tcp/30303/p2p/16Uiu2HAmFy8BrJhCEmCYrUfBdSNkrPw6VHExtv4rRp1DSBnCPgx8"
"/ip4/161.35.244.35/tcp/9000/p2p/16Uiu2HAm1JGyYjjraM95y9wK4WFjg7k79H1xAGWGU8FTXXczjcbW"
"/dns4/node-02.do-ams3.status.prod.statusim.net/tcp/30303/p2p/16Uiu2HAmSve7tR5YZugpskMv2dmJAsMUKmfWYEKRXNUxRaTCnsXV"
"/dns4/node-01.gc-us-central1-a.status.prod.statusim.net/tcp/30303/p2p/16Uiu2HAkwBp8T6G77kQXSNMnxgaMky1JeyML5yqoTHRM8dbeCBNb"

I guess it's neat that bridge is the only one without a DNS name in the multiaddress so it's easy to spot.

Now, whether this works or not is an entirely separate question.

jakubgs commented 2 years ago

Based on prod metrics I think it works:

admin@bridge-01.do-ams3.status.prod:/docker/nim-waku-bridge % c 0:8008/metrics | grep bridge_transfers
# HELP waku_bridge_transfers Number of messages transferred between Waku v1 and v2 networks
# TYPE waku_bridge_transfers counter
waku_bridge_transfers_total{type="v1_to_v2"} 83473.0
waku_bridge_transfers_created{type="v1_to_v2"} 1646920686.0

Probably.

jakubgs commented 2 years ago

Also added nim-waku-bridge to Prometheus scrape jobs: https://github.com/status-im/infra-hq/commit/a1f07555

And we have some metrics: image

jm-clius commented 2 years ago

Great! Thanks, Jakub.

jakubgs commented 2 years ago

I do find it weird how the message rate for test and prod fleets is about the same:

image

But that might just mean that eth.test and eth.prod are connected and share messages.

jm-clius commented 2 years ago

But that might just mean that eth.test and eth.prod are connected and share messages.

Yeah, I noticed this too and also assumed they carry the same traffic.

jakubgs commented 2 years ago

I think this is done.