status-im / status-mobile

a free (libre) open source, mobile OS for Ethereum
https://status.app
Mozilla Public License 2.0
3.9k stars 987 forks source link

Messages are slow and some of them are marked as not delivered (1-1 chat between relay/light) #21497

Open churik opened 2 hours ago

churik commented 2 hours ago

Bug Report

Problem

Messages are marked as "not sent" and are sending for 3-4 mins, at least 10% of them are marked as not sent.

Expected behavior

messages are sent instantly

Actual behavior

some stuck in "not sent" and are sending for 3-4 mins

IMPORTANT NOTE: Receiver received messages in banch after getting from background in 30-40 sec, video from receiver: IOS: https://github.com/user-attachments/assets/c5276e46-ba73-4a70-9f35-889232ffa6d5 IOS logs: logs (18).zip

Also this bug is reproducible because the input is available for receiver once they received contact request, and for sender there is a delay in accepting.

And when receiver comes from background, messages are marked as "Received" on desktop side as well.

Reproduction

  1. Device A (Mobile): put app to background for 3+ mins
  2. Device B(Desktop): start sending messages to Device A

https://github.com/user-attachments/assets/6ae1e524-6907-4472-b70d-050fac0a4fc7

Log from desktop for this particular minute: geth12_37.log

Data: 30 peers / node is store01.ams.status.prod More logs/ video from previous hour. More video: https://drive.google.com/file/d/1wcd-Ya3-yAztm5B0Of216i1p_ribt5IU/view?usp=sharing Logs: https://drive.google.com/file/d/10z-cHiuDn_rt9ScL5sj_DGPv_tME9-GZ/view?usp=sharing

Additional Information

churik commented 2 hours ago

cc @richard-ramos @chaitanyaprem any clues from logs why it might happen?

chaitanyaprem commented 1 hour ago

took a quick look at the logs and looks like store queries are timing out which might be what is causing messages shown in sending state. But

ERROR[10-24|10:38:12.425|github.com/waku-org/go-waku/waku/v2/api/missing/missing_messages.go:190] could not query storenode peerID=16Uiu2HAmAUdrQ3uwzuE4Gy4D56hX6uLKEeerJAnhKEHZ3DxF1EfT contentTopics="[/waku/1/0x950464c4/rfc26 /waku/1/0x5143d1fc/rfc26 /waku/1/0x43a43bd3/rfc26 /waku/1/0x8328c702/rfc26 /waku/1/0xb0e6c5a6/rfc26 /waku/1/0xc0086e00/rfc26 /waku/1/0x0cb06bba/rfc26 /waku/1/0x6c2cab7a/rfc26 /waku/1/0x8ee11cb2/rfc26 /waku/1/0x276cf54e/rfc26]" pubsubTopic=/waku/2/rs/16/32 from=1729763103353073000 to=1729763160259783000 error="context canceled" attempt=1

Another thing i had noticed is that about 11 peers are descored and messages are ignored from them and some of them seem like fleet nodes. So either the desktop node has some bad connectivity causing it to descore the peers or something else is causing this. This coupled with store queries failing must be causing messages not getting delivered.

DEBUG[10-24|10:38:33.323|go.uber.org/zap/sugar.go:198]                                                               IWANT: ignoring peer 16Uiu2HAmNTpGnyZ8W1BK2sXEmgSCNWiyDKgRU3NBR2DXST2HzxRU with score below threshold [score = -135.692805] 

DEBUG[10-24|10:38:29.651|go.uber.org/zap/sugar.go:198]                                                               IHAVE: ignoring peer 16Uiu2HAm2M7xs7cLPc3jamawkEqbr7cUJX11uvY7LxQ6WFUdUKUT with score below threshold [score = -288.371036] 

it would help debug further if we have more logs and messageHashes of msgs in the video. As of now logs are there for 30 seconds only which is very short duration.

churik commented 53 minutes ago

@chaitanyaprem

there are more of logs here: https://drive.google.com/file/d/10z-cHiuDn_rt9ScL5sj_DGPv_tME9-GZ/view?usp=sharing

churik commented 31 minutes ago

ONE MORE IMPORTANT NOTE: It is reproducible pretty easy for me on desktop-desktop as well - you need to be offline as the receiver, I'd expect that they marked as a "Sent" but not "Sending failed".

Waku thread

richard-ramos commented 14 minutes ago

It is reproducible pretty easy for me on desktop-desktop as well - you need to be offline as the receiver, I'd expect that they marked as a "Sent" but not "Sending failed".

I'll try to reproduce it with this approach, and hopefully see anything interesting in the logs!

churik commented 9 minutes ago

well actually seems it is my setup? story node? - but 2 ppl from Poland / Ukraine are not facing this issue.