waku-org / nwaku

Waku node and protocol.
Other
198 stars 50 forks source link

bug: failed to retrieve peer info via peer exchange protocol #2875

Open romanzac opened 1 month ago

romanzac commented 1 month ago

Problem

During interop test_get_peers execution, docker container reported an error:

TRC 2024-07-04 07:44:52.943+00:00 Dialing peer failed                        topics="waku node peer_manager" tid=1 file=peer_manager.nim:250 peerId=16U*92ceLW reason="timed out" proto=/vac/waku/peer-exchange/2.0.0-alpha1
WRN 2024-07-04 07:44:52.943+00:00 failed to retrieve peer info via peer exchange protocol topics="waku node" tid=1 file=waku_node.nim:1111 error=dial_failure
ERR 2024-07-04 07:44:52.943+00:00 error while fetching peers from peer exchange tid=1 file=node_factory.nim:360 error="Peer exchange failure: dial_failure"

This issues is only happening for nwaku. Peer exchange works fine for go-waku, going through the same steps.

Impact

High occurrence, low impact.

To reproduce

  1. Please checkout https://github.com/waku-org/waku-interop-tests/pull/51/commits/ae2757d83b273b1bf4f7dce7d560e3a1058c9f73
  2. cd waku-interop-tests
  3. python -m venv .venv
  4. source .venv/bin/activate
  5. pip install -r requirements.txt
  6. pre-commit install
  7. pytest tests/peer_exchange/test_peer_exchange.py -k 'test_get_peers'

Expected behavior

Peer exchange requestor (node3) would be able to retrieve peers from peer exchange responder (node2).

Screenshots/logs

node1_2024-07-04_09-44-1719dc257d-3891-43f6-935a-6735eb6d5a7dharbor.status.im_wakuorg_nwaku:latest.log node2_2024-07-04_09-44-1719dc257d-3891-43f6-935a-6735eb6d5a7dharbor.status.im_wakuorg_nwaku:latest.log node3_2024-07-04_09-44-1719dc257d-3891-43f6-935a-6735eb6d5a7dharbor.status.im_wakuorg_nwaku:latest.log

nwaku version/commit hash

Custom Docker build from wakunode2-v0.30.1 on ARM/MacOS

Ivansete-status commented 1 month ago

Morning @jm-clius ! I considered this issue as not super urgent, given the low relevance that we give to the PX protocol. Let me know if I am right :) Cheers

jm-clius commented 1 month ago

Hey @Ivansete-status, I think we do prioritise peer exchange (nwaku as a server) for Status mobile clients. If this is an occasional failure (i.e. if simply retrying the peer exchange lookup works) or if this is not reported as an issue by js-waku or go-waku in light mode, I would say it's indeed simply a testing issue and has relatively low priority. @chaitanyaprem can perhaps confirm if there are any noticeable issues with peer exchange to the fleets from Status mobile.

romanzac commented 1 month ago

Hi, I went again to check if the multiaddr to dial is actually the correct one:

This is happening on Node3:

TRC 2024-07-09 07:56:50.312+00:00 Selecting peer from peerstore              topics="waku node peer_manager" tid=1 file=peer_manager.nim:898 protocol=/vac/waku/peer-exchange/2.0.0-alpha1
TRC 2024-07-09 07:56:50.312+00:00 Got peer from service slots                topics="waku node peer_manager" tid=1 file=peer_manager.nim:918 peerId=16U*nSEExx multi=/ip4/172.18.34.74/tcp/31556 protocol=/vac/waku/peer-exchange/2.0.0-alpha1
TRC 2024-07-09 07:56:50.312+00:00 Adding newly dialed peer to manager        topics="waku node peer_manager" tid=1 file=peer_manager.nim:610 peerId=16Uiu2HAmAb95MCBXjnuLYUbZAf2kteyq3nEYr7wJvZWmR2nSEExx address=/ip4/172.18.34.74/tcp/31556 proto=/vac/waku/peer-exchange/2.0.0-alpha1
TRC 2024-07-09 07:56:50.312+00:00 Adding peer to manager                     topics="waku node peer_manager" tid=1 file=peer_manager.nim:135 peerId=16U*nSEExx addresses=@[/ip4/172.18.34.74/tcp/31556]
TRC 2024-07-09 07:56:50.312+00:00 Dialing peer                               topics="waku node peer_manager" tid=1 file=peer_manager.nim:237 wireAddr=@[/ip4/172.18.34.74/tcp/31556] peerId=16U*nSEExx proto=/vac/waku/peer-exchange/2.0.0-alpha1
TRC 2024-07-09 07:56:50.312+00:00 Dialing (new)                              topics="libp2p dialer" tid=1 file=dialer.nim:318 peerId=16U*nSEExx protos="@[\"/vac/waku/peer-exchange/2.0.0-alpha1\"]"
TRC 2024-07-09 07:56:50.312+00:00 connection not found                       topics="libp2p connmanager" tid=1 file=connmanager.nim:280 peerId=16U*nSEExx
TRC 2024-07-09 07:57:00.347+00:00 Releasing slot                             topics="libp2p semaphore" tid=1 file=semaphore.nim:82 available=47 queue=0
TRC 2024-07-09 07:57:00.347+00:00 Released slot                              topics="libp2p semaphore" tid=1 file=semaphore.nim:94 available=48 queue=0
TRC 2024-07-09 07:57:00.347+00:00 Dial canceled                              topics="libp2p dialer" tid=1 file=dialer.nim:329 conn=nil
TRC 2024-07-09 07:57:00.347+00:00 Dialing peer failed                        topics="waku node peer_manager" tid=1 file=peer_manager.nim:250 peerId=16U*nSEExx reason="timed out" proto=/vac/waku/peer-exchange/2.0.0-alpha1
WRN 2024-07-09 07:57:00.347+00:00 failed to retrieve peer info via peer exchange protocol topics="waku node" tid=1 file=waku_node.nim:1111 error=dial_failure
ERR 2024-07-09 07:57:00.347+00:00 error while fetching peers from peer exchange tid=1 file=node_factory.nim:360 error="Peer exchange failure: dial_failure"

Corresponding REST API request to Node1 to get the peer store. It contains Node2's info which is then provided to Node3 during startup.

09:56:50 src.node.api_clients.base_client INFO curl -v -X GET "http://127.0.0.1:7055/admin/v1/peers" -H "Content-Type: application/json" -d 'None'
09:56:50 src.node.api_clients.base_client INFO Response status code: 200. Response content: b'[{"multiaddr":"/ip4/172.18.34.74/tcp/31556/p2p/16Uiu2HAmAb95MCBXjnuLYUbZAf2kteyq3nEYr7wJvZWmR2nSEExx","protocols":[{"protocol":"/vac/waku/relay/2.0.0","connected":true}]}]'
09:56:50 src.libs.common DEBUG Node1 peer ID 16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip
09:56:50 src.libs.common DEBUG Node2 peer ID 16Uiu2HAmAb95MCBXjnuLYUbZAf2kteyq3nEYr7wJvZWmR2nSEExx

And finally

 addresses=@[/ip4/172.18.34.74/tcp/31556] <=> {"multiaddr":"/ip4/172.18.34.74/tcp/31556/p2p/16Uiu2HAmAb95MCBXjnuLYUbZAf2kteyq3nEYr7wJvZWmR2nSEExx

node1_2024-07-09_09-56-46095ff8e5-6bc5-4156-81bc-8cc2153afd91harbor.status.im_wakuorg_nwaku:latest.log node2_2024-07-09_09-56-46095ff8e5-6bc5-4156-81bc-8cc2153afd91harbor.status.im_wakuorg_nwaku:latest.log node3_2024-07-09_09-56-46095ff8e5-6bc5-4156-81bc-8cc2153afd91harbor.status.im_wakuorg_nwaku:latest.log test.log

romanzac commented 1 month ago

Nwaku as Node1 and Gowaku as Node2 (responder) combination works. Gowaku as Node1 and Nwaku as Node2 (responder) combination doesn't work.

romanzac commented 1 month ago

Hey @Ivansete-status, I think we do prioritise peer exchange (nwaku as a server) for Status mobile clients. If this is an occasional failure (i.e. if simply retrying the peer exchange lookup works) or if this is not reported as an issue by js-waku or go-waku in light mode, I would say it's indeed simply a testing issue and has relatively low priority. @chaitanyaprem can perhaps confirm if there are any noticeable issues with peer exchange to the fleets from Status mobile.

I am kind of happy to not having peer exchange working. Peer exchange and peer & connection management through REST API are aid features for Discv5 not being good enough. Could we evolve to "discv6" being able to deal with lightweight discovery and bootstrap scenarios well ? Especially the part when outside intervention is needed is my concern. If I am attacker I would start from these two features to destabilize running network.

I'm ready to help with discv6 simulations.

jm-clius commented 1 month ago

Especially the part when outside intervention is needed is my concern.

While I understand your concerns about the REST API for peer management, I'm not sure I see the relation here with the peer exchange protocol. Peer exchange is simply a req-resp protocol where a service node is requested to return a random sampling from the discv5 dht. There are of course the usual security concerns as with all req-resp protocols, but I don't see this as necessarily tied to similar concerns about gaining access to a node's local Admin API?

I'm ready to help with discv6 simulations.

:D Such a protocol will be great. The next development in discv5 would likely be introducing topic advertisements to improve efficiency, although this seems to be a particularly tough research problem.

romanzac commented 1 month ago

Especially the part when outside intervention is needed is my concern.

While I understand your concerns about the REST API for peer management, I'm not sure I see the relation here with the peer exchange protocol. Peer exchange is simply a req-resp protocol where a service node is requested to return a random sampling from the discv5 dht. There are of course the usual security concerns as with all req-resp protocols, but I don't see this as necessarily tied to similar concerns about gaining access to a node's local Admin API?

I'm ready to help with discv6 simulations.

:D Such a protocol will be great. The next development in discv5 would likely be introducing topic advertisements to improve efficiency, although this seems to be a particularly tough research problem.

If I remember correctly we need "REST API for peer management" to have possibility to add node to peer store to speedup network bootstrap process. And the reason is Discv5 takes very long to do it for us when we need tests to run quickly. Am I missing something ?

romanzac commented 1 month ago

Especially the part when outside intervention is needed is my concern.

While I understand your concerns about the REST API for peer management, I'm not sure I see the relation here with the peer exchange protocol. Peer exchange is simply a req-resp protocol where a service node is requested to return a random sampling from the discv5 dht. There are of course the usual security concerns as with all req-resp protocols, but I don't see this as necessarily tied to similar concerns about gaining access to a node's local Admin API?

I'm ready to help with discv6 simulations.

:D Such a protocol will be great. The next development in discv5 would likely be introducing topic advertisements to improve efficiency, although this seems to be a particularly tough research problem.

When I personally see non linear reward for my time, I am hooked :) Better DiscvX might reduce possible attack surface, and steal work for QAs and integrators. It might not be necessary to change discv5 engine, maybe just move those "aid features" closer to it, and remove external access to them ? Converting security recommendations into more opinionated code. Not sure I am clear here...

I recall combustion engine needed someone to actually start it physically with handle. Later electronic starter was used to everyone's satisfaction. Engine design was kept intact.

jm-clius commented 1 month ago

"REST API for peer management" to have possibility to add node to peer store to speedup network bootstrap process. And the reason is Discv5 takes very long to do it for us when we need tests to run quickly. Am I missing something

Not quite. I'm sure it can be used in some test scenarios for this purpose, but its purpose is more general: as an admin of my node I want to be able to connect to a user-specified peer. This is a common use case for p2p networks - it may be because I know of a peer that is not using discv5, because I'm running a local network or simply because I want to test connectivity to a specific address. In other words, if I want to quickly force connection to a list of peers that's known to me for whichever reason I can use this API, but it's not meant (or used currently) to replace/augment discv5 or other discovery methods.

romanzac commented 1 month ago

"REST API for peer management" to have possibility to add node to peer store to speedup network bootstrap process. And the reason is Discv5 takes very long to do it for us when we need tests to run quickly. Am I missing something

Not quite. I'm sure it can be used in some test scenarios for this purpose, but its purpose is more general: as an admin of my node I want to be able to connect to a user-specified peer. This is a common use case for p2p networks - it may be because I know of a peer that is not using discv5, because I'm running a local network or simply because I want to test connectivity to a specific address. In other words, if I want to quickly force connection to a list of peers that's known to me for whichever reason I can use this API, but it's not meant (or used currently) to replace/augment discv5 or other discovery methods.

For the normal operation of the network and its operator this feature is low frequency of usage and high impact. While for attacker it is high frequency and high impact. This asymmetry skewed in favor of an attacker is what troubles me. Sure attacker needs to first obtain access to the REST API port, local box. After that however, is unstoppable. No authentication, no authorization. Our achilles heel is more vulnerable than it is the case for traditional server based services e.g. DB engines, K8S. They all have well protected service ports, if that is what we can compare our admin API endpoint to.

Surely REST API calls to add peer don't replace/augment discv5 or other discovery methods, but they modify content of the peer store. All discovery methods purpose is exactly the same. Except API call does that without any sensitivity or inteligence.

Don't know if that would be too daring to try, simply remove those endpoints for few days/weeks and evaluate after ? Prod fleet for Status. What is the worst case, network not operational ? What lessons we might learn ? We are still beta.