waku-org / nwaku

Waku node and protocol.
199 stars 51 forks source link

bug: lightpush fails with `Failed to request a message push: dial_failure` after the peer node restart #2567

Closed fbarbu15 closed 5 months ago

fbarbu15 commented 5 months ago

To reproduce

  1. Start relay node and subscribe to a topic
  2. Start lightpush node and connect it to the node above
  3. Check that lightpush works
  4. Restart relay node and re-subscribe to a topic
  5. Check that lightpush works

Expected behavior

Should work

Actual behavior

Failed to request a message push: dial_failure

Script to reproduce it:

printf "\nAssuming you already have a docker network called waku\n"
# if not something like this should create it: docker network create --driver bridge --subnet --gateway waku


printf "\nStarting containers\n"

container_id1=$(docker run -d -i -t -p 37343:37343 -p $tcp_port:$tcp_port -p 37345:37345 -p 37346:37346 -p 37347:37347 $node_1 --listen-address= --rest=true --rest-admin=true --websocket-support=true --log-level=TRACE --rest-relay-cache-capacity=100 --websocket-port=37345 --rest-port=37343 --tcp-port=$tcp_port --discv5-udp-port=37346 --rest-address= --nat=extip:$ext_ip --peer-exchange=true --discv5-discovery=true --cluster-id=$cluster_id --metrics-server=true --metrics-server-address= --metrics-server-port=37347 --metrics-logging=true --pubsub-topic=/waku/2/rs/2/0 --lightpush=true --relay=true)
docker network connect --ip $ext_ip waku $container_id1

printf "\nSleeping 2 seconds\n"
sleep 2

response=$(curl -X GET "" -H "accept: application/json")
enrUri=$(echo $response | jq -r '.enrUri')

# Extract the first non-WebSocket address
ws_address=$(echo $response | jq -r '.listenAddresses[] | select(contains("/ws") | not)')

# Check if we got an address, and construct the new address with it
if [[ $ws_address != "" ]]; then
    identifier=$(echo $ws_address | awk -F'/p2p/' '{print $2}')
    if [[ $identifier != "" ]]; then
        echo $multiaddr_with_id
        echo "No identifier found in the address."
        exit 1
    echo "No non-WebSocket address found."
    exit 1

container_id2=$(docker run -d -i -t -p 25908:25908 -p 25909:25909 -p 25910:25910 -p 25911:25911 -p 25912:25912 $node_2 --listen-address= --rest=true --rest-admin=true --websocket-support=true --log-level=TRACE --rest-relay-cache-capacity=100 --websocket-port=25910 --rest-port=25908 --tcp-port=25909 --discv5-udp-port=25911 --rest-address= --nat=extip: --peer-exchange=true --discv5-discovery=true --cluster-id=$cluster_id --pubsub-topic=/waku/2/rs/2/0 --lightpush=true --relay=false --discv5-bootstrap-node=$enrUri --lightpushnode=$multiaddr_with_id)

docker network connect --ip waku $container_id2

printf "\nSleeping 10 seconds\n"
sleep 10

printf "\nSubscribe\n"
curl -v -X POST "" -H "Content-Type: application/json" -d '["/waku/2/rs/2/0"]'

printf "\nSleeping 2 seconds\n"
sleep 2

printf "\nLightpush message on subscribed pubusub topic\n"                            
curl -v -X POST "" -H "Content-Type: application/json" -d '{"pubsubTopic": "/waku/2/rs/2/0", "message": {"payload": "TGlnaHQgcHVzaCB3b3JrcyEh", "contentTopic": "/myapp/1/latest/proto", "timestamp": 1712149720320589312}}'

printf "\nRestarting NODE 1\n"  
docker restart $container_id1

printf "\nSleeping 10 seconds\n"
sleep 10

printf "\nSubscribe\n"
curl -v -X POST "" -H "Content-Type: application/json" -d '["/waku/2/rs/2/0"]'

printf "\nSleeping 2 seconds\n"
sleep 2

printf "\nLightpush message on subscribed pubusub topic\n"                            
curl -v -X POST "" -H "Content-Type: application/json" -d '{"pubsubTopic": "/waku/2/rs/2/0", "message": {"payload": "TGlnaHQgcHVzaCB3b3JrcyEh", "contentTopic": "/myapp/1/latest/proto", "timestamp": 1712149720320589312}}'

Logs lightpush_node.log relay_node.log

gabrielmer commented 5 months ago

This error happens because when we restart the lightpush service node container (container_id1) a new multiaddress is generated, so the lightpush client node is trying to dial the old multiaddress with no response.

To fix that, we have to start the service node with the --nodekey parameter. For example, using '--nodekey=6a29e767c96a2a380bb66b9a6ffcd6eb54049e14d796a1d866307b8beb7aee58'

That is, replacing line 15 in the script for

container_id1=$(docker run -d -i -t -p 37343:37343 -p $tcp_port:$tcp_port -p 37345:37345 -p 37346:37346 -p 37347:37347 $node_1 --listen-address= --rest=true --rest-admin=true --websocket-support=true --log-level=TRACE --rest-relay-cache-capacity=100 --websocket-port=37345 --rest-port=37343 --tcp-port=$tcp_port --discv5-udp-port=37346 --rest-address= --nat=extip:$ext_ip --peer-exchange=true --discv5-discovery=true --cluster-id=$cluster_id --metrics-server=true --metrics-server-address= --metrics-server-port=37347 --metrics-logging=true --pubsub-topic=/waku/2/rs/2/0 --lightpush=true --relay=true --nodekey=6a29e767c96a2a380bb66b9a6ffcd6eb54049e14d796a1d866307b8beb7aee58)

We avoid having a new multiaddress generated after container restart, and there's no more dial failures anymore

@fbarbu15 please confirm if it makes sense and works for you too

fbarbu15 commented 5 months ago

thanks, this fixes the test!