waku-org / nwaku

Waku node and protocol.
Other
188 stars 49 forks source link

bug: failed to store peers to persistent storage #2792

Open romanzac opened 4 weeks ago

romanzac commented 4 weeks ago

Problem

During interop test_use_persistent_storage_survive_restart execution, docker container reported an error:

WRN 2024-06-07 19:05:37.208+00:00 failed to store peers topics="waku node peer_manager" tid=1 file=peer_manager.nim:117 err="peer info encoding failed: Enncoding public key failed: KeyError"

Database files to store peers were created: -rw-r--r-- 1 root root 4096 Jun 8 07:25 peers.db -rw-r--r-- 1 root root 32768 Jun 8 07:25 peers.db-shm -rw-r--r-- 1 root root 8272 Jun 8 07:25 peers.db-wal

Impact

Not clear yet. Not sure if the peers were actually stored or not. The test relies on the assumption that storage is reliable.

To reproduce

  1. Please checkout https://github.com/waku-org/waku-interop-tests/pull/45/commits/909274d7b0b95899c12649d2155aac363c69286e
  2. cd waku-interop-tests
  3. python -m venv .venv
  4. source .venv/bin/activate
  5. pip install -r requirements.txt
  6. pre-commit install
  7. edit src/env_vars.py NODE_1 = get_env_var("NODE_1", DEFAULT_NWAKU)
  8. pytest tests/peer_connection_management/test_peer_store.py -k 'test_use_persistent_storage_survive_restart'

Expected behavior

Confirmation wether peers were actually stored or not. And it would be cool to have a non root folder as a destination for the database files. It somewhat interesting idea to start fresh node process over shared storage and kill the unresponsive node process.

Screenshots/logs

node1_2024-06-07_21-05-34.log

nwaku version

wakuorg/nwaku: wakunode2-v0.28.0

romanzac commented 4 weeks ago

Hi @SionoiS , idea for my test borrowed from https://github.com/waku-org/nwaku/issues/2265

SionoiS commented 3 weeks ago

The encoding is a bug for sure but I suspect that it would work if there was a wait before encoding peer info.

romanzac commented 3 weeks ago

The encoding is a bug for sure but I suspect that it would work if there was a wait before encoding peer info.

Let me try to wait and check peerstore content.

romanzac commented 3 weeks ago

@SionoiS I can confirm it works for me. I've built custom Waku docker image where :

const PeerPersistenceDbUrl = "peers.db"  - >  const PeerPersistenceDbUrl = "/shared/peers.db" 

Changed my test https://github.com/waku-org/waku-interop-tests/pull/45/commits/4b19aeccd3cf4446168642fafbd49513958717e2

Got result:

20:28:30 test_peer_store DEBUG Node 1 connected peers before [{'multiaddr': '/ip4/172.18.144.219/tcp/23874/p2p/16Uiu2HAmPBw66YmKyFB35DHwhtHuRmSWxbeCci8p1qZYKoh4w9CH', 'protocols': [{'protocol': '/vac/waku/relay/2.0.0', 'connected': True}]}]
20:28:30 test_peer_store DEBUG Node 2 connected peers before [{'multiaddr': '/ip4/172.18.217.140/tcp/2791/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip', 'protocols': [{'protocol': '/vac/waku/relay/2.0.0', 'connected': True}]}]
20:28:31 test_peer_store DEBUG Node 1 connected peers after [{'multiaddr': '/ip4/172.18.196.71/tcp/12597/p2p/16Uiu2HAmF6vU3TqKdqxd5pynaUTu4hMThz94h7YfYRcdSqnGK353', 'protocols': [{'protocol': '/vac/waku/relay/2.0.0', 'connected': True}]}, {'multiaddr': '/ip4/172.18.144.219/tcp/23874/p2p/16Uiu2HAmPBw66YmKyFB35DHwhtHuRmSWxbeCci8p1qZYKoh4w9CH', 'protocols': [{'protocol': '/vac/waku/relay/2.0.0', 'connected': True}]}]
20:28:31 test_peer_store DEBUG Node 2 connected peers after [{'multiaddr': '/ip4/172.18.217.140/tcp/2791/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip', 'protocols': [{'protocol': '/vac/waku/relay/2.0.0', 'connected': True}]}]
20:28:31 test_peer_store DEBUG Node 3 connected peers after [{'multiaddr': '/ip4/172.18.217.140/tcp/2791/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip', 'protocols': [{'protocol': '/vac/waku/relay/2.0.0', 'connected': True}]}]

Node 2 and Node 3 have the same peer in the peer store after soft takeover. Last step could be to kill Node2 process.

I suggest that small change in the peer store path as a fix. This could open possibilities.

Perhaps there are also other things need to be done to have graceful takeover which I have missed. What do you think ?

SionoiS commented 3 weeks ago

@SionoiS I can confirm it works for me. I've built custom Waku docker image where :

const PeerPersistenceDbUrl = "peers.db"  - >  const PeerPersistenceDbUrl = "/shared/peers.db" 

Nwaku should have a configurable PeerPersistance DB path. I'll add an issue for this.

Perhaps there are also other things need to be done to have graceful takeover which I have missed. What do you think ?

I feel like the test could be simpler if only one node was started and peers added via add_peers no? In my mind, the steps would be; start node, add peers, stop node, start node, check DB. WDYT?

What do you mean by "graceful takeover"?

romanzac commented 3 weeks ago

@SionoiS I can confirm it works for me. I've built custom Waku docker image where :

const PeerPersistenceDbUrl = "peers.db"  - >  const PeerPersistenceDbUrl = "/shared/peers.db" 

Nwaku should have a configurable PeerPersistance DB path. I'll add an issue for this.

Perhaps there are also other things need to be done to have graceful takeover which I have missed. What do you think ?

I feel like the test could be simpler if only one node was started and peers added via add_peers no? In my mind, the steps would be; start node, add peers, stop node, start node, check DB. WDYT?

What do you mean by "graceful takeover"?

I've rewritten my test to what I would like to have it to simulate real world use case. I realized it was not working even at beginning. When Node 3 starts with peer storage of Node 1 it should contain peers known to Node 1, but instead it contains peers known to Node 2. Strange.

New test version at https://github.com/waku-org/waku-interop-tests/pull/45/commits/0cdeb6a50a66d870cd30559901971ac287779200

romanzac commented 3 weeks ago

@SionoiS I can confirm it works for me. I've built custom Waku docker image where :

const PeerPersistenceDbUrl = "peers.db"  - >  const PeerPersistenceDbUrl = "/shared/peers.db" 

Changed my test waku-org/waku-interop-tests@4b19aec

Got result:

20:28:30 test_peer_store DEBUG Node 1 connected peers before [{'multiaddr': '/ip4/172.18.144.219/tcp/23874/p2p/16Uiu2HAmPBw66YmKyFB35DHwhtHuRmSWxbeCci8p1qZYKoh4w9CH', 'protocols': [{'protocol': '/vac/waku/relay/2.0.0', 'connected': True}]}]
20:28:30 test_peer_store DEBUG Node 2 connected peers before [{'multiaddr': '/ip4/172.18.217.140/tcp/2791/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip', 'protocols': [{'protocol': '/vac/waku/relay/2.0.0', 'connected': True}]}]
20:28:31 test_peer_store DEBUG Node 1 connected peers after [{'multiaddr': '/ip4/172.18.196.71/tcp/12597/p2p/16Uiu2HAmF6vU3TqKdqxd5pynaUTu4hMThz94h7YfYRcdSqnGK353', 'protocols': [{'protocol': '/vac/waku/relay/2.0.0', 'connected': True}]}, {'multiaddr': '/ip4/172.18.144.219/tcp/23874/p2p/16Uiu2HAmPBw66YmKyFB35DHwhtHuRmSWxbeCci8p1qZYKoh4w9CH', 'protocols': [{'protocol': '/vac/waku/relay/2.0.0', 'connected': True}]}]
20:28:31 test_peer_store DEBUG Node 2 connected peers after [{'multiaddr': '/ip4/172.18.217.140/tcp/2791/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip', 'protocols': [{'protocol': '/vac/waku/relay/2.0.0', 'connected': True}]}]
20:28:31 test_peer_store DEBUG Node 3 connected peers after [{'multiaddr': '/ip4/172.18.217.140/tcp/2791/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip', 'protocols': [{'protocol': '/vac/waku/relay/2.0.0', 'connected': True}]}]

Node 2 and Node 3 have the same peer in the peer store after soft takeover. Last step could be to kill Node2 process.

I suggest that small change in the peer store path as a fix. This could open possibilities.

Perhaps there are also other things need to be done to have graceful takeover which I have missed. What do you think ?

Here Node3 should contain same info with Node 1 not Node 2.

romanzac commented 3 weeks ago

Results with Node 1 being killed after its backup - Node 3 has started:

Before:

DEBUG    test_peer_store:test_peer_store.py:81 Node 1 peerID 16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip
DEBUG    test_peer_store:test_peer_store.py:82 Node 2 peerID 16Uiu2HAmKR1pRT4DbSvHbuhp6cCwame9WTRuqw7wfywnTmijgt86
DEBUG    test_peer_store:test_peer_store.py:83 Node 1 connected peers before [{'multiaddr': '/ip4/172.18.233.183/tcp/21203/p2p/16Uiu2HAmKR1pRT4DbSvHbuhp6cCwame9WTRuqw7wfywnTmijgt86', 'protocols': [{'protocol': '/vac/waku/relay/2.0.0', 'connected': True}]}]
DEBUG    test_peer_store:test_peer_store.py:84 Node 2 connected peers before [{'multiaddr': '/ip4/172.18.75.179/tcp/60540/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip', 'protocols': [{'protocol': '/vac/waku/relay/2.0.0', 'connected': True}]}]

After:

DEBUG    test_peer_store:test_peer_store.py:93 Node 2 connected peers after [{'multiaddr': '/ip4/172.18.75.179/tcp/60540/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip', 'protocols': [{'protocol': '/vac/waku/relay/2.0.0', 'connected': False}]}]
DEBUG    test_peer_store:test_peer_store.py:94 Node 3 connected peers after [{'multiaddr': '/ip4/172.18.75.179/tcp/60540/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip', 'protocols': [{'protocol': '/vac/waku/relay/2.0.0', 'connected': False}]}]
SionoiS commented 3 weeks ago

Ah I understand now! Yes that would be a bug for sure. At least we know it does not work...

romanzac commented 3 weeks ago

Ah I understand now! Yes that would be a bug for sure. At least we know it does not work...

Let me know, if I can help to test anything related to this issue in the meantime.