waku-org / nwaku

Waku node and protocol.
Other
199 stars 52 forks source link

bug: the node gets silently blocked when there is no connection with the postgres database #2272

Closed Ivansete-status closed 8 months ago

Ivansete-status commented 9 months ago

Problem

The node doesn't start and it gets stuck without giving clear evidence of the error. That happened when using the status.test nodes with postgres configured and for any reason, the firewall rule was broken and therefore, the waku node didn't have access to the database host.

Impact

Critical

To reproduce

If you can reproduce the behavior, steps to reproduce:

  1. Start postgres container in machine A (this is optional because the bug happened when no visibility happened.)
  2. Start nwaku in machine B and configure it as Store node and point the Postgres database to the host A.
  3. Disallow connectivity between A and B.
  4. Try to start nwaku. It will get stuck.

Expected behavior

The waku node should stop immediately stating that it cannot connect to the database server.

Screenshots/logs

DBG 2023-12-05 13:13:05.034+00:00 1/7 Setting up storage                     topics="wakunode main" tid=1 file=wakunode2.nim:63
DBG 2023-12-05 13:13:05.038+00:00 2/7 Retrieve dynamic bootstrap nodes       topics="wakunode main" tid=1 file=wakunode2.nim:71
DBG 2023-12-05 13:13:05.038+00:00 Discovering nodes using Waku DNS discovery topics="wakunode app" tid=1 file=app.nim:188 url=enrtree://AIO6LUM3IVWCU2KCPBBI6FEH2W42IGK3ASCZHZGG5TIXUR56OGQUO@test.status.nodes.status.im
DBG 2023-12-05 13:13:05.038+00:00 init WakuDnsDiscovery                      topics="wakunode app" tid=1 file=waku_dnsdisc.nim:92 locationUrl=enrtree://AIO6LUM3IVWCU2KCPBBI6FEH2W42IGK3ASCZHZGG5TIXUR56OGQUO@test.status.nodes.status.im
DBG 2023-12-05 13:13:05.039+00:00 init success                               topics="wakunode app" tid=1 file=waku_dnsdisc.nim:98
INF 2023-12-05 13:13:05.039+00:00 Finding peers using Waku DNS discovery     topics="waku dnsdisc" tid=1 file=waku_dnsdisc.nim:51
INF 2023-12-05 13:13:05.127+00:00 Successfully discovered ENR                topics="waku dnsdisc" tid=1 file=waku_dnsdisc.nim:65 count=3
INF 2023-12-05 13:13:05.134+00:00 Successfully discovered nodes              topics="waku dnsdisc" tid=1 file=waku_dnsdisc.nim:82 count=3
DBG 2023-12-05 13:13:05.134+00:00 3/7 Initializing node                      topics="wakunode main" tid=1 file=wakunode2.nim:78
DBG 2023-12-05 13:13:05.137+00:00 no peer storage found                      tid=1 file=peer_manager.nim:487
INF 2023-12-05 13:13:05.137+00:00 Initializing networking                    tid=1 file=waku_node.nim:145 addrs="@[/dns4/node-01.do-ams3.status.test.statusim.net/tcp/30303, /dns4/node-01.do-ams3.status.test.statusim.net/tcp/443/wss]"
DBG 2023-12-05 13:13:05.138+00:00 no relay sharding information, peer filtering disabled topics="waku discv5" tid=1 file=waku_discv5.nim:62
DBG 2023-12-05 13:13:05.138+00:00 4/7 Mounting protocols                     topics="wakunode main" tid=1 file=wakunode2.nim:85
INF 2023-12-05 13:13:05.138+00:00 Created WakuMetadata protocol              topics="waku node" tid=1 file=protocol.nim:115 clusterId=0
INF 2023-12-05 13:13:05.139+00:00 mounting relay protocol                    topics="waku node" tid=1 file=waku_node.nim:378
INF 2023-12-05 13:13:05.139+00:00 relay mounted successfully                 topics="waku node" tid=1 file=waku_node.nim:397
DBG 2023-12-05 13:13:05.140+00:00 subscribe                                  topics="waku node" tid=1 file=waku_node.nim:271 pubsubTopic=/waku/2/default-waku-test/proto
DBG 2023-12-05 13:13:05.140+00:00 subscribe                                  topics="waku relay" tid=1 file=protocol.nim:218 pubsubTopic=/waku/2/default-waku-test/proto
DBG 2023-12-05 13:13:05.140+00:00 subscribe                                  topics="waku node" tid=1 file=waku_node.nim:271 pubsubTopic=/waku/2/status-signed-test-1
DBG 2023-12-05 13:13:05.141+00:00 subscribe                                  topics="waku relay" tid=1 file=protocol.nim:218 pubsubTopic=/waku/2/status-signed-test-1
NOT 2023-12-05 13:13:05.141+00:00 routing only signed traffic                topics="wakunode app" tid=1 file=app.nim:451 protectedTopic=/waku/2/status-signed-test-1 publicKey=04070335f90e017ec1d7cb9a8e53d084dcf2c937cccb4272ed902881641613c6e0d46cf5531eee7ff8076616a2b8ac3436f71568f7a7032478fba3cac04c90e4e6
DBG 2023-12-05 13:13:05.141+00:00 adding validator to signed topic           tid=1 file=wakunode2_validator_signed.nim:53 topic=/waku/2/status-signed-test-1 publicTopicKey=04070335f90e017ec1d7cb9a8e53d084dcf2c937cccb4272ed902881641613c6e0d46cf5531eee7ff8076616a2b8ac3436f71568f7a7032478fba3cac04c90e4e6
INF 2023-12-05 13:13:05.141+00:00 mounting rendezvous discovery protocol     topics="waku node" tid=1 file=waku_node.nim:1082
INF 2023-12-05 13:13:05.142+00:00 mounting libp2p ping protocol              topics="waku node" tid=1 file=waku_node.nim:1040

nwaku version/commit hash

Commit 11e0e1cf2820345dede50d5f3da3d43e5a9587b3 , feature/status-test-dogfooding branch

Ivansete-status commented 8 months ago

The moment the node got blocked it wasn't because of the Postgres connection being stopped. It happened due to other reasons, that I didn't capture properly.

I tried to replicate the issue in the following way:

  1. Start the docker container with Postgres database.

  2. Start the nwaku node which establishes a connection with the database.

  3. Disconnect the Postgres docker container from the network (command: docker network disconnect nwaku_default 66172f2e7dac.

  4. After this, the nwaku node stops immediately with the following message:

    ERR 2024-01-10 20:24:37.053+01:00 Unrecoverable error occurred topics="wakunode app" tid=437665 file=app.nim:498 error="postgres health check error: error in asyncpool query: error in dbConnQuery calling waitQueryToFinish: error in query: FATAL: terminating connection due to administrator command\nserver closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n"

  5. Then, if I try to restart the node again, it clearly fails with the error:

    ERR 2024-01-10 20:47:39.887+01:00 4/7 Mounting protocols failed topics="wakunode main" tid=441078 file=wakunode2.nim:89 error="failed to setup archive driver: error creating table: error in createMessageTable: connRes.isErr in query: failed to stablish a new connection: exception opening new connection: connection to server at \"localhost\" (127.0.0.1), port 5432 failed: Connection refused\n\tIs the server running on that host and accepting TCP/IP connections?\n"


I will close this issue because I couldn't replicate it. I'll reopen it again, with better detail, if this happens again.

( cc @chair28980 )