waku-org / go-waku

Go implementation of Waku v2 protocol
https://waku.org/
Other
113 stars 44 forks source link

bug/regression: DELETE /relay/v1/subscriptions freezes in certain conditions #1034

Open fbarbu15 opened 6 months ago

fbarbu15 commented 6 months ago

Describe the bug This seems as a regression as it started to happen on 2/9/2024. It also happens only in certain conditions as described bellow

To Reproduce

  1. Have 2 nodes connected via relay protocol
  2. Subscribe both nodes to 5 topics. ex: POST /relay/v1/subscriptions with payload: ["/waku/2/rs/0/0", "/waku/2/rs/0/1", "/waku/2/rs/0/9", "/waku/2/rs/0/25", "/waku/2/rs/0/1000"]
  3. (optional) publish messages and check that the message relay works
  4. Delete all 5 subscriptions from the 1st node ex: DELETE /relay/v1/subscriptions with payload: ["/waku/2/rs/0/0", "/waku/2/rs/0/1", "/waku/2/rs/0/9", "/waku/2/rs/0/25", "/waku/2/rs/0/1000"]
  5. Try to delete the subscriptions from the 2nd node

Actual behavior Request freezes and is only stops if the client times out

The issue only reproduces if node1 unsubscribes first. If we reverse the order it works fine The issue only reproduces if node2 tries to unsubscribes from 4 or more topics. It work is it's bellow 4/

go-waku version/commit hash

harbor.status.im/wakuorg/go-waku:latest

node1.log node2.log

node1 args ['--listen-address=0.0.0.0', '--rest=true', '--rest-admin=true', '--websocket-support=true', '--log-level=DEBUG', '--rest-relay-cache-capacity=100', '--websocket-port=53777', '--rest-port=53775', '--tcp-port=53776', '--discv5-udp-port=53778', '--rest-address=0.0.0.0', '--nat=extip:172.18.230.234', '--peer-exchange=true', '--discv5-discovery=true', '--cluster-id=0', '--min-relay-peers-to-publish=1', '--legacy-filter=false', '--rest-filter-cache-capacity=50', '--relay=true', '--nodekey=30348dd51465150e04a5d9d932c72864c8967f806cce60b5d26afeca1e77eb68']

node2 args ['--listen-address=0.0.0.0', '--rest=true', '--rest-admin=true', '--websocket-support=true', '--log-level=DEBUG', '--rest-relay-cache-capacity=100', '--websocket-port=5557', '--rest-port=5555', '--tcp-port=5556', '--discv5-udp-port=5558', '--rest-address=0.0.0.0', '--nat=extip:172.18.188.172', '--peer-exchange=true', '--discv5-discovery=true', '--cluster-id=0', '--min-relay-peers-to-publish=1', '--legacy-filter=false', '--rest-filter-cache-capacity=50', '--relay=true', '--discv5-bootstrap-node=enr:-Ly4QOJ7KFVvRp5EttMNtjmNbGtF-hYIG8l6RSTgj7Jb8K1HZUgHCA234U-gSw2TGgx3cYaRCRcTH2DUzW2O7bhtuv6GAY2r1df1gmlkgnY0gmlwhKwRAAKKbXVsdGlhZGRyc5gACgR_AAABBtIR3QMACgSsEQACBtIR3QOJc2VjcDI1NmsxoQM3Tqpf5eFn4Jztm4gB0Y0JVSJyxyZsW8QR-QU5DZb-PYN0Y3CC0hCDdWRwgtIShXdha3UyAQ']

chaitanyaprem commented 5 months ago

You are using cluster 0 which means the node will be subscribed to default pubsubTopic which doesn't follow sharding format. Later you are subscribing the node to sharded topics. This is not recommended usage. Can you rerun this test with cluster-ID other than 0 and confirm the behaviour.

fbarbu15 commented 5 months ago

I tried to do this but something is wrong, when I use a different cluster ID, go-waku doesn't receive the messages published from nwaku, not sure what am I doing wrong. If I use nwaku -> nwaku everything is fine with any cluster ID (0 or >0) So the problem is with go-waku or I'm missing something when I'm starting the nodes

I've created a raw script that reproduces this issue, can you please have a look? Thanks gowaku_fails_to_fetch_messages.zip

fbarbu15 commented 5 months ago

I found why I had the issue from my last comment on clusters different than 0. Seems for other clusters I need to pass the --pubsub-topic as a docker flag for this to work, it's not enough to do a relay/v1/subscriptions. On nwaku side it works both ways to subscribe to a topic, via config(docker flag) or via API call.

@chaitanyaprem do you think this behavior is ok?

Also I can confirm that this bug with DELETE /relay/v1/subscriptions freeze disappears on other cluster IDs

chaitanyaprem commented 5 months ago

I found why I had the issue from my last comment on clusters different than 0. Seems for other clusters I need to pass the --pubsub-topic as a docker flag for this to work, it's not enough to do a relay/v1/subscriptions. On nwaku side it works both ways to subscribe to a topic, via config(docker flag) or via API call.

@chaitanyaprem do you think this behavior is ok?

Also I can confirm that this bug with DELETE /relay/v1/subscriptions freeze disappears on other cluster IDs

That is odd, subscribing to a pubsubTopic via REST API should also work. Can you share debug logs when you invoke this command, would like to see what could be the issue?

fbarbu15 commented 5 months ago

Logs: logs_when_pubsub_topic_is_passed.zip logs_when_pubusub_topic_is_not_passed.zip

Node start:

docker run -i -t -p 9503:9503 -p 9504:9504 -p 9505:9505 -p 9506:9506 -p 9507:9507 harbor.status.im/wakuorg/nwaku:latest --listen-address=0.0.0.0 --rest=true --rest-admin=true --websocket-support=true --log-level=TRACE --rest-relay-cache-capacity=100 --websocket-port=9505 --rest-port=9503 --tcp-port=9504 --discv5-udp-port=9506 --rest-address=0.0.0.0 --nat=extip:172.18.239.110 --peer-exchange=true --discv5-discovery=true --cluster-id=2 --metrics-server=true --metrics-server-address=0.0.0.0 --metrics-server-port=9507 --metrics-logging=true --relay=true --nodekey=30348dd51465150e04a5d9d932c72864c8967f806cce60b5d26afeca1e77eb68 
docker run -i -t -p 27921:27921 -p 27922:27922 -p 27923:27923 -p 27924:27924 -p 27925:27925 harbor.status.im/wakuorg/go-waku:latest --listen-address=0.0.0.0 --rest=true --rest-admin=true --websocket-support=true --log-level=DEBUG --rest-relay-cache-capacity=100 --websocket-port=27923 --rest-port=27921 --tcp-port=27922 --discv5-udp-port=27924 --rest-address=0.0.0.0 --nat=extip:172.18.133.187 --peer-exchange=true --discv5-discovery=true --cluster-id=2 --min-relay-peers-to-publish=1 --rest-filter-cache-capacity=50 --relay=true --discv5-bootstrap-node=enr:-LO4QJVGXwyXhEyIRPgpgp8x_0-jI95hMl6U5HYM-QQbA1H9Acaov_FI3Y-KQ33tTaST4Tsr29Au2K1_bH7bkQQT8H0BgmlkgnY0gmlwhKwS726KbXVsdGlhZGRyc4wACgSsEu9uBiUh3QOCcnOFAAIBAACJc2VjcDI1NmsxoQM3Tqpf5eFn4Jztm4gB0Y0JVSJyxyZsW8QR-QU5DZb-PYN0Y3CCJSCDdWRwgiUihXdha3UyAQ

docker network connect --ip 172.18.239.110 waku 770e6f73565eb80511dab2cc8c7c6e43abf03a8ee79245e5bf39901f2217dba6
docker network connect --ip 172.18.133.187 waku 45e51f2a4ecadec78cc8033d5528ac90d612fbbe6fc9098124083d4c4965e029

Subscribe requests:

curl -v -X POST "http://127.0.0.1:9503/relay/v1/subscriptions" -H "Content-Type: application/json" -d '["/waku/2/rs/2/0"]'
curl -v -X POST "http://127.0.0.1:27921/relay/v1/subscriptions" -H "Content-Type: application/json" -d '["/waku/2/rs/2/0"]'

Nwaku publish:

curl -v -X POST "http://127.0.0.1:9503/relay/v1/messages/%2Fwaku%2F2%2Frs%2F2%2F0" -H "Content-Type: application/json" -d '{"payload": "UmVsYXkgd29ya3MhIQ==", "contentTopic": "/test/1/waku-relay/proto", "timestamp": '$(date +%s%N)'}'

Go-waku get message:

curl -v -X GET "http://127.0.0.1:27921/relay/v1/messages/%2Fwaku%2F2%2Frs%2F2%2F0" -H "Content-Type: application/json"
chaitanyaprem commented 5 months ago

I found why I had the issue from my last comment on clusters different than 0. Seems for other clusters I need to pass the --pubsub-topic as a docker flag for this to work, it's not enough to do a relay/v1/subscriptions. On nwaku side it works both ways to subscribe to a topic, via config(docker flag) or via API call.

Went through the logs and in all cases i see 200 OK returned for POST subscription. So, i am not able to understand where the issue is . Can you elaborate more on how you think subscription is failing when pubsub topic is not passed in the arguments and only REST API is used?

fbarbu15 commented 5 months ago

Sure, here's the steps:

Issue happens only if NODE1 is nwaku and NODE2 is go-waku Issue doesn't happen if:

Here's a shell script that you can run directly to reproduce the issue:

#!/bin/bash
printf "\nAssuming you already have a docker network called waku\n"
# if not something like this should create it: docker network create --driver bridge --subnet 172.18.0.0/16 --gateway 172.18.0.1 waku

cluster_id=2
pubsub_topic="/waku/2/rs/$cluster_id/0"
encoded_pubsub_topic=$(echo "$pubsub_topic" | sed 's:/:%2F:g')
node_1=harbor.status.im/wakuorg/nwaku:latest
node_2=harbor.status.im/wakuorg/go-waku:latest

printf "\nStarting containers\n"

container_id1=$(docker run -d -i -t -p 12297:12297 -p 12298:12298 -p 12299:12299 -p 12300:12300 -p 12301:12301 $node_1 --listen-address=0.0.0.0 --rest=true --rest-admin=true --websocket-support=true --log-level=DEBUG --rest-relay-cache-capacity=100 --websocket-port=12299 --rest-port=12297 --tcp-port=12298 --discv5-udp-port=12300 --rest-address=0.0.0.0 --nat=extip:172.18.45.95 --peer-exchange=true --discv5-discovery=true --cluster-id=$cluster_id --relay=true --nodekey=30348dd51465150e04a5d9d932c72864c8967f806cce60b5d26afeca1e77eb68)
docker network connect --ip 172.18.45.95 waku $container_id1

printf "\nSleeping 2 seconds\n"
sleep 2

response=$(curl -X GET "http://127.0.0.1:12297/debug/v1/info" -H "accept: application/json")
enrUri=$(echo $response | jq -r '.enrUri')

container_id2=$(docker run -d -i -t -p 8158:8158 -p 8159:8159 -p 8160:8160 -p 8161:8161 -p 8162:8162 $node_2 --listen-address=0.0.0.0 --rest=true --rest-admin=true --websocket-support=true --log-level=DEBUG --rest-relay-cache-capacity=100 --websocket-port=8160 --rest-port=8158 --tcp-port=8159 --discv5-udp-port=8161 --rest-address=0.0.0.0 --nat=extip:172.18.207.159 --peer-exchange=true --discv5-discovery=true --cluster-id=$cluster_id --relay=true --discv5-bootstrap-node=$enrUri)

docker network connect --ip 172.18.207.159 waku $container_id2

printf "\nSleeping 1 seconds\n"
sleep 1

printf "\nSubscribe\n"
curl -X POST "http://127.0.0.1:12297/relay/v1/subscriptions" -H "Content-Type: application/json" -d "[\"$pubsub_topic\"]"
curl -X POST "http://127.0.0.1:8158/relay/v1/subscriptions" -H "Content-Type: application/json" -d "[\"$pubsub_topic\"]"

printf "\nSleeping 70 seconds to be sure that the nodes connect via discv5\n"
sleep 70

printf "\nRelay from NODE 1\n"                            
curl -X POST "http://127.0.0.1:12297/relay/v1/messages/$encoded_pubsub_topic" -H "Content-Type: application/json" -d '{"payload": "UmVsYXkgd29ya3MhIQ==", "contentTopic": "/test/1/waku-relay/proto", "timestamp": '$(date +%s%N)'}'

printf "\nSleeping 1 seconds\n"
sleep 1

printf "\nCheck message in NODE 2\n"
response=$(curl -X GET "http://127.0.0.1:8158/relay/v1/messages/$encoded_pubsub_topic" -H "Content-Type: application/json")

printf "\nResponse: $response\n"

if [ "$response" == "no subscription found for pubsubTopic" ] || [ "$response" == "[]" ] || [ -z "$response" ] || [ "$response" == "null" ]; then
    printf "\nError: NODE 2 didn't find the message, the response is null, or the response contains the text 'null'\n"
    exit 1
else
    printf "\nSuccess: NODE 2 received the message\n"
fi
fbarbu15 commented 5 months ago

Update, I see nwaku was also changed to work as go-waku :) So with the latest version of nwaku if we don't pass --pubsub-topic the peer cannot retrieve the published messages

fbarbu15 commented 4 months ago

Found that the bug (DELETE /relay/v1/subscriptions freezes in certain conditions) actually still reproduces on cluster id 2 but not always