waku-org / nwaku

Waku node and protocol.
Other
188 stars 49 forks source link

bug: message not delivered during interop test #2369

Closed romanzac closed 2 weeks ago

romanzac commented 5 months ago

Problem

At rare occasion, message is not delivered when a node is shut down during the test sequence.

Impact

It is low occurrence, medium impact. The node may shut down before the message in sent/published to other nodes.

To reproduce

Run interop tests at https://github.com/waku-org/waku-interop-tests/actions/workflows/go_waku_daily.yml Simillar issue happened also with nwaku during execution of test_publish_after_node1_restarts. https://github.com/waku-org/waku-interop-tests/actions/workflows/nim_waku_daily.yml

Expected behavior

Node which is shutting down should reject accepting new messages. When the message gets stuck, next time the node starts it retries the delivery. I would assume shutdown was part of a restart. Open question is what to do when node stays shut for long periods of time. Messages lost ?

Related to https://github.com/waku-org/go-waku/issues/1014

Screenshots/logs

image

data_attachments_8d428324ec7625a3.txt

Ivansete-status commented 4 months ago

Thanks for submitting the issue @romanzac !

Given its random nature, it seems a timing issue. Notice that gossip-sub doesn't ensure message delivery and there is no native mechanism that allows the sender node to know whether its messages were properly sent. Maybe a naïve solution would be to increase a "wait" somewhere within the tests.

May kindly add some more detail about the node interaction? Maybe some kind of simple diagram where we can see the messages flow, and what is expected.

On the other hand, I'm not very familiarized with the interop tests, so instructions on how to run the test locally would be golden as well :) Thanks in advance!

romanzac commented 4 months ago

Related to https://github.com/waku-org/nwaku/issues/2388

romanzac commented 4 months ago

Thanks for submitting the issue @romanzac !

Given its random nature, it seems a timing issue. Notice that gossip-sub doesn't ensure message delivery and there is no native mechanism that allows the sender node to know whether its messages were properly sent. Maybe a naïve solution would be to increase a "wait" somewhere within the tests.

May kindly add some more detail about the node interaction? Maybe some kind of simple diagram where we can see the messages flow, and what is expected.

On the other hand, I'm not very familiarized with the interop tests, so instructions on how to run the test locally would be golden as well :) Thanks in advance!

I went thought the test "test_publish_after_node1_restarts" and here are the steps: 1) Bring up two Waku nodes (node1, node2) and warm them up 2) Move to execute the actual test 3a) Send message from node1 with Pubsub topic to the network 3b) Ether the loop to check if message is possible receive from all other nodes (node2) 3c) Restart node1 3d) Wait for node1 - rest API to get response 3e) Check subscriptions exist on all peers (node1, node2) 3f) Repeat 3a, 3b 3g) Repeat 5 times all steps in 3)

Our tests looks a bit bumpy, yet I would not expect it fails. We test REST API responsiveness as "known at that time" best practice how to determine the node1 is ready after restart. I would rise a question here:

1) If rest API check is not sufficient, what it the proper way to determine with high degree of certainty node is ready? 2) What else can integrator do, to achieve high level or reliability >99.9% while interacting with restarting Waku node?

Please note this problem is not entirely technical. What we want at the end is a stable app (frontend) behavior for the end user. It could be translated as frontend <> Waku interaction being reliable enough to abstract away possibly non deterministic behavior of Waku network. I would also include Status-go or other stakeholders to set the reliability expectation. And after that I would move to figure out technical details.

romanzac commented 4 months ago

Related to https://github.com/waku-org/docs.waku.org/issues/165

vpavlin commented 3 months ago

So if I understand correctly the problem is that there is a potential race condition where REST API is still available and responsive, but the node is already in process of shutting down and potentially disconnected from peers, hence cannot publish new messages (although REST API still accepts them).

That sounds like the right way to fix this would be change the order of components being shut down - disable most REST API (apartf from health?) endpoints and then start the actual shutdown of p2p layer?

I don't think we can get to 100% safe flow where no messages are lost. Especially with non-graceful shutdowns like OOM kills. So my thoughts would be to

  1. Make sure our health endpoint is actually reflecting the state (https://github.com/waku-org/nwaku/issues/2270, https://github.com/waku-org/nwaku/issues/2020)
  2. Make sure users understand the inherent unreliability of p2p and gossipsub communication (i.e. they know they might need to retry if something goes wrong) - also, keep in mind that duplicate messages are (safely?) ignored, so if you send the same message (resulting in the same hash https://rfc.vac.dev/spec/14/#deterministic-message-hashing)

I would not make the node to persist messages and automatically retry after restart - that should be a job of some higher level abstraction in my opinion

HOpe this helps:)

romanzac commented 3 months ago

So if I understand correctly the problem is that there is a potential race condition where REST API is still available and responsive, but the node is already in process of shutting down and potentially disconnected from peers, hence cannot publish new messages (although REST API still accepts them).

That sounds like the right way to fix this would be change the order of components being shut down - disable most REST API (apartf from health?) endpoints and then start the actual shutdown of p2p layer?

I don't think we can get to 100% safe flow where no messages are lost. Especially with non-graceful shutdowns like OOM kills. So my thoughts would be to

  1. Make sure our health endpoint is actually reflecting the state (chore: improve node initialization sequence to make it more operator friendly #2270, chore: detailed json report on /health endpoint #2020)
  2. Make sure users understand the inherent unreliability of p2p and gossipsub communication (i.e. they know they might need to retry if something goes wrong) - also, keep in mind that duplicate messages are (safely?) ignored, so if you send the same message (resulting in the same hash https://rfc.vac.dev/spec/14/#deterministic-message-hashing)

I would not make the node to persist messages and automatically retry after restart - that should be a job of some higher level abstraction in my opinion

HOpe this helps:)

Thanks for comprehensive reply. It looks we are on the track to improve readings on node health state. I need to admit I am to sure where exactly to pin point expectation about how reliable message delivery should be. When I think about 99.8% delivery -> 0.2% messages lost, it could mean different thing for different user base. Different people might react differently on lost messages, perhaps also based on importance of messages they sent over Waku network. When we talk about 10000 messages, 20 will get lost. When we talk about 1 000 000 messages, 2000 messages will get lost.

Let's do some math with Whatsapp statistics. Based on report https://whatsthebigdata.com/whatsapp-statistics/ their 2.78B users exchange 140B messages per day. That is 50 messages per day per user. With 0.2% non delivery rate it is 280M messages lost per day?! Divided by 50, we should have 5 600 000 angry users every day. Is it the case ? I don't think so. I believe Whatsapp could deliver which much lower, orders of magnitude lower loss rate. Perhaps 1000 angry users is acceptable ? If my math is correct, that would be 99.99996428% messages are delivered.

I remember from "Waku & Vac DST/QA" meeting, Vac DST team is looking at the reliability problem and trying various simulations.

My question is how RLN feature could fit with reliability, when I cannot send retry message within the same epoch (lets assume 1 sec) ? What would our guidance for integrator be here ?

romanzac commented 3 months ago

I would be really happy we could seal reliability problems within Waku. Integrator need not to worry about anything related to gossipsub communication while using Waku in lib mode. Once they want to operate their own Waku network of course it is different story.

jm-clius commented 3 months ago

I would be really happy we could seal reliability problems within Waku

Indeed, we want to improve what we can to make Waku as reliable as possible. There are at least three layers at which the reliability problem must be addressed:

  1. "Low layer"/routing issues: this includes reducing race conditions, API inconsistencies, useability, etc. in the node implementations that may lead to message losses. On this layer we are also working on improving connectivity and discovery, etc. 2."Middle layer"/SDKs: Compose protocols in a way that increase redundancy and reduce single points of failure; introduce syncing to cache and retrieve messages for reliability over longer time periods. We're spending a considerable amount of effort on this layer currently. See https://github.com/waku-org/research/issues/80 and https://github.com/waku-org/pm/issues/114

Strictly speaking (1) and (2) are the only elements directly under Waku's control. Neither will ever guarantee the end-to-end message reliability that is needed within the (encrypted) application layer. Just as Whatsapp cannot rely on the underlying TCP/IP network for its app reliability, applications building on Waku would still need an end-to-end reliability overlay. Since this is a tall order for many small app teams, we are also working on:

  1. "App layer"/SDK: mechanisms to provide end-to-end reliability (e.g. seq numbering) within the encrypted app layer e.g. what we're trying to do within the Chat SDK.

I think we might be publishing a blog article on this topic soon. :) cc @vpavlin

romanzac commented 2 weeks ago

I can't find any failure related to "message not delivered" in Daily test reports for past 2 months. I believe message delivery reliability has improved beyond degree noticeable by interop tests.

We can discuss further, outside this issue, about the next level tests (probably need more scale and randomness).

I would like to close this issue, let me know if any objection: @fbarbu15 @jm-clius @vpavlin @Ivansete-status