waku-org / nwaku

Waku node and protocol.
Other
183 stars 46 forks source link

bug: node won't start with RLN in on-chain dynamic mode #2662

Open romanzac opened 2 weeks ago

romanzac commented 2 weeks ago

Problem

During "test_publish_with_valid_payloads_dynamic_at_slow_rate" execution, first container running Node1 won't start even within the timeout of 10 minutes. Static mode is working fine. Please have a look.

I've built docker image from release: wakunode2-v0.27.0

Impact

High occurrence, medium severity. RLN in dynamic mode not functional.

To reproduce

  1. Please checkout https://github.com/waku-org/waku-interop-tests/pull/30/commits/73d09518c77923c29a88a2373a82ff7c5f8c3bab
  2. cd waku-interop-tests
  3. python -m venv .venv
  4. source .venv/bin/activate
  5. pip install -r requirements.txt
  6. pre-commit install
  7. pytest tests/relay/test_rln.py -k 'test_publish_with_valid_payloads_dynamic_at_slow_rate'

Expected behavior

RLN Relay in on-chain mode working.

Screenshots/logs

node1_2024-05-03_12-45-385a5531e5-c304-462b-a7f3-f58ba92d0a0bharbor.status.im_wakuorg_nwaku:latest.log node1_2024-05-03_12-45-38c9737de4-6df4-4f4c-b308-323968164308harbor.status.im_wakuorg_nwaku:latest.log node1_2024-05-03_12-46-0196df9f86-46da-4444-8b3a-7b6d0e7e060aharbor.status.im_wakuorg_nwaku:latest.log node2_2024-05-03_12-45-38c9737de4-6df4-4f4c-b308-323968164308harbor.status.im_wakuorg_nwaku:latest.log test_run.log

rymnc commented 2 weeks ago

the second log you shared seems to show the node has finished syncing - does the /health endpoint still not return true?

rymnc commented 2 weeks ago

the fourth log shows the node has not finished syncing

rymnc commented 2 weeks ago

on my own node -

image

seems to be the health endpoint returning not ready and healthy in a flaky manner. cc: @NagyZoltanPeter

rymnc commented 2 weeks ago

I see in your logs that we use debug/v1/info in the test - which results in

Response status code: 200. Response content: b'{"listenAddresses":["/ip4/172.18.205.233/tcp/6886/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip","/ip4/172.18.205.233/tcp/6887/ws/p2p/16Uiu2HAmGNtM2rQ8abySFNhqPDFY4cmfAEpfo9Z9fD3NekoFR2ip"],"enrUri":"enr:-LO4QGGlww8liwBmcHFHdLcXwt-Uq0c6iU6cdDJ6pWlh2avnWILMdWa9P_iCS0kiWhLuECjRTMvxoykPXyP5sKjcx88BgmlkgnY0gmlwhKwSzemKbXVsdGlhZGRyc4wACgSsEs3pBhrn3QOCcnOFAAABAACJc2VjcDI1NmsxoQM3Tqpf5eFn4Jztm4gB0Y0JVSJyxyZsW8QR-QU5DZb-PYN0Y3CCGuaDdWRwghrohXdha3UyAQ"}'
INFO     src.node.waku_node:waku_node.py:193 REST service is ready !!

does node1 not respond after this is available?

rymnc commented 2 weeks ago

testing with https://github.com/waku-org/nwaku/pull/2664

NagyZoltanPeter commented 2 weeks ago

on my own node - image seems to be the health endpoint returning not ready and healthy in a flaky manner. cc: @NagyZoltanPeter

It's a known issue, already solved here: https://github.com/waku-org/nwaku/pull/2612 Will be part of next release!

rymnc commented 2 weeks ago

awesome, i might be missing something, but does the pr address the node healthy and not ready flakiness?

rymnc commented 2 weeks ago

suggest to use this image - quay.io/wakuorg/nwaku-pr:2664-rln-v1

NagyZoltanPeter commented 2 weeks ago

awesome, i might be missing something, but does the pr address the node healthy and not ready flakiness?

Indeed! It separates node ready from initialization in general from status of rln_relay. There is a - future improved - array of protocol statuses, in there you can see rln_relay status now (and yet there is changing from ready to synchronize and back as of actual status). Likewise:

{
    "nodeHealth": "Ready",
    "protocolsHealth": [
     {
         "Rln Relay": "Ready"
      }
    ]
}
rymnc commented 2 weeks ago

awesome, i might be missing something, but does the pr address the node healthy and not ready flakiness?

Indeed! It separates node ready from initialization in general from status of rln_relay. There is a - future improved - array of protocol statuses, in there you can see rln_relay status now (and yet there is changing from ready to synchronize and back as of actual status). Likewise:

{
    "nodeHealth": "Ready",
    "protocolsHealth": [
     {
         "Rln Relay": "Ready"
      }
    ]
}

wow, nice 🔥

romanzac commented 2 weeks ago

Good progress on health info indeed! I'll wait for the next release to enable onchain tests. And I will try to play with /health endpoint on Monday. Thanks for now.

NagyZoltanPeter commented 1 week ago

@rymnc: Hi, about the original issue. From the logs I don't see the actual problem. The first log tells me that node is started well. It took almost 7 min to sync on-chain RLN blocks. The second node (or log) ends with still in sync, but only 3 min from startup. Maybe the start sequence needs to be checked as it seems to me that nodes started with delay while the timeout applies from the first node start, from the 10 min timeout I think. Can you please check this scenario. Maybe you need to extend the timeout to give enough time for the second node to get in sync.

romanzac commented 1 week ago

Adding logs here from my yesterdays testing for PR2664 I have also started to work on integrating node health check for interop tests https://github.com/waku-org/waku-interop-tests/pull/35 . Hopefully we can have next nwaku release and matching interop tests for RLN matured at the same time. Say no to QA lag! :)

test.log

node1_2024-05-07_21-59-067729a733-8a72-4309-8480-58237116b75dquay.io_wakuorg_nwaku-pr:2664-rln-v1.log

node1_2024-05-07_21-59-062a690f5f-9983-4913-91a0-ddfdd21352acquay.io_wakuorg_nwaku-pr:2664-rln-v1.log

node1_2024-05-07_21-59-07d4850f44-b5e1-4704-aa2b-83486e800626quay.io_wakuorg_nwaku-pr:2664-rln-v1.log

node2_2024-05-07_21-59-067729a733-8a72-4309-8480-58237116b75dquay.io_wakuorg_nwaku-pr:2664-rln-v1.log