status-im / status-desktop

Status Desktop client made in Nim & QML
https://status.app
Mozilla Public License 2.0
287 stars 78 forks source link

Messages that are sent from desktop when the client goes offline and before offline banner is shown will be lost (communities, "Relay mode") #12813

Open churik opened 9 months ago

churik commented 9 months ago

Bug Report

Description

Messages that were sent from an offline state when the desktop client doesn't detect that the device is offline (when the banner appears, ~1-1,5 mins) will never be delivered. The issue is not reproducible for mobile.

Steps to reproduce

Prerequisites: 2 users have joined the community

  1. Desktop: switch off Wi-Fi, send several messages to the community (send messages until "Offline" banner appears)
  2. Desktop: send several messages after "Offline" banner appears (they will be shown with "Resend" option)
  3. Desktop: switch on Wi-fi
  4. Mobile: check community

https://github.com/status-im/status-desktop/assets/4557972/af0bdd66-6fd4-4c8a-ad1e-d4bcd93b4ac7

Expected behavior

all messages are delivered

Actual behavior

messages sent on step 1 will be lost

Additional Information

Log: geth copy 2.log

cammellos commented 9 months ago

most likely due to relay not having message confirmations? if lightpush enabled, do you have the same? it should work ok in 1-to-1, since we resend until delivery confirmed

John-44 commented 9 months ago

This bug will be perceived by users as messaging reliability issue, and some solution is therefor high priority. I won't opinion on the solution here, but seems like something for 0.15.5 (or 0.16 at the latest)

cammellos commented 9 months ago

This is something discussed before in waku calls, I believe there were some hesitation on their side, but I strongly believe it is something that we should add, we can take example from wakuv1 that reduced the toll on the network by only enabling if asked by the peer.

John-44 commented 9 months ago

ping @fryorcraken @richard-ramos @jm-clius - FYI in case you thoughts regarding the issue discussed above

churik commented 9 months ago

@cammellos Light mode enabled on desktop or "Relay mode" in 1-1 chat: all messages are marked as "Resend" immediately app detects an offline state, and after coming back online messages are recent.

So yeah, the issue is specific to "Relay mode" (default on desktop) in communities

jrainville commented 9 months ago

Andrea:

one thing that is a bit odd, is that messages were not resent so I would start from investigating that basically if peer count is 0 the message should be resent in theory, but I don't know on relay, I know in lightpush on mobile it works if you at least do that, then the issue becomes smaller

cammellos commented 9 months ago

Thanks, we have seen this issue again but on a good connection

fryorcraken commented 9 months ago

We will propose strategies that can be used to mitigate the issue. https://github.com/waku-org/go-waku/issues/921 tracks this work.

chaitanyaprem commented 9 months ago

Thanks, we have seen this issue again but on a good connection

I understand why messages were not sent before we identify network is disconnected. But this seems strange, is this reproducible and do we have any further details when this has occured? Are any peers connected for the pubsubTopic? Any other useful info would help in root-causing the issue.

cammellos commented 9 months ago

Thanks, we have seen this issue again but on a good connection

I understand why messages were not sent before we identify network is disconnected. But this seems strange, is this reproducible and do we have any further details when this has occured? Are any peers connected for the pubsubTopic? Any other useful info would help in root-causing the issue.

Not sure we can replicate consistently, the connection was stable, but the peer count was 0, in this case the message should not get confirmed (I am not sure it's the case now to be honest, I can check I guess, that would alleviate the issue, but we'd still have the half open connection)

chaitanyaprem commented 9 months ago

Thanks, we have seen this issue again but on a good connection

I understand why messages were not sent before we identify network is disconnected. But this seems strange, is this reproducible and do we have any further details when this has occured? Are any peers connected for the pubsubTopic? Any other useful info would help in root-causing the issue.

Not sure we can replicate consistently, the connection was stable, but the peer count was 0, in this case the message should not get confirmed (I am not sure it's the case now to be honest, I can check I guess, that would alleviate the issue, but we'd still have the half open connection)

Relay publish should fail in such a scenario if it is initialized with minPeersToPublish value of more than 0. Is that handled to mark the message for retry? Unless it hit some kind of a corner case where during publish the peer connection is still considered active and then peer connection got dropped off/marked down (due to TCP delay in identifying the same).

cammellos commented 9 months ago

Unless it hit some kind of a corner case where during publish the peer connection is still considered active and then peer connection got dropped off/marked down (due to TCP delay in identifying the same).

This 100% happens, it's easy to replicate, cut off internet just before sending a message and you will get an half open connection, we had this problem on wakuv1 and that's why we implemented confirmations. On mobile happens very often, on desktop should happen less, but I guess depends on how stable your connection and peers are.

I have the suspicious though that we don't mark messages for resend if the peer count is 0, that I need to have a look and we can address, but the issue above will still be valid.

chaitanyaprem commented 9 months ago

This 100% happens, it's easy to replicate, cut off internet just before sending a message and you will get an half open connection, we had this problem on wakuv1 and that's why we implemented confirmations. On mobile happens very often, on desktop should happen less, but I guess depends on how stable your connection and peers are.

Yes, i have tested it myself today and agree that this is due to half open TCP or TCP not yet identifying network disconnection and messages remaining in send buffer. I have documented a draft approach to take care of such scenarios https://github.com/waku-org/go-waku/issues/921#issuecomment-1831564676. Do go through and give your feedback as well.

I have the suspicious though that we don't mark messages for resend if the peer count is 0, that I need to have a look and we can address, but the issue above will still be valid.

Ok, sure...would be awesome if you can also point me to the code where this is set.

endulab commented 9 months ago

My two cents. Status side works correctly. But the p2p layer does not inform quick enough about the lost connection. The flow looks like this:

  1. Turn off connection
  2. We have >0 peers we can send messages
  3. After 25-30 seconds (on my machine) p2p ConnectionGater emits signal, we see the "no connection" banner, the number of peers become 0.
  4. Now we can send messages but they are marked as "Resend". When connection is back they will be resent correctly.
cammellos commented 8 months ago

After a discussion, we decided to implement an e2e syncing mechanism between peers (we can extend the datasync layer that does something similar albeit for 1-to-1 chats), and later we can re-evaluate whether confirmations from the next hop are still needed. I am currently working on it.

https://github.com/status-im/status-go/issues/4438

jrainville commented 3 months ago

@cammellos I see that the PR you linked above is merged. Should we close this issue?

cammellos commented 3 months ago

I don't think so, still valid, even if you have peersyncing enabled

fryorcraken commented 2 months ago

Work to fix this kind of issue has been transferred to Waku. This is around detection of disconnection in Waku Relay and feedback that messages were not sent.

Work is tracked with https://github.com/waku-org/pm/issues/184