status-im / status-desktop

Status Desktop client made in Nim & QML
https://status.app
Mozilla Public License 2.0
287 stars 78 forks source link

Community is failed to be fetched #15153

Closed virginiabalducci closed 4 days ago

virginiabalducci commented 3 months ago

Bug Report

Description

User cannot join a community, clicking on the community link does nothing.

Steps to reproduce

  1. User opens a community link
  2. Nothing happens on the UI.

There are console errors:

ERR 2024-06-12 11:51:55.148-03:00 rpc response error                         topics="rpc" tid=4479695 file=core.nim:36 err="\nstatus-go error [methodName:wakuext_fetchCommunity, code:-32000, message:too many failed requests ]\n"
ERR 2024-06-12 11:51:55.148-03:00 error doing rpc request                    topics="rpc" tid=4479695 file=core.nim:40 methodName=wakuext_fetchCommunity exception="\nstatus-go error [methodName:wakuext_fetchCommunity, code:-32000, message:too many failed requests ]\n"
ERR 2024-06-12 11:51:55.148-03:00 Error requesting community info            topics="community-service" tid=4479673 file=service.nim:1602 msg="\"\\nstatus-go error [methodName:wakuext_fetchCommunity, code:-32000, message:too many failed requests ]\\n\""
INF 2024-06-12 11:51:55.157-03:00 mailserver not working                     topics="mailservers-service" tid=4479673 file=service.nim:161
INF 2024-06-12 11:51:55.158-03:00 removing active mailserver                 topics="mailservers-service" tid=4479673 file=service.nim:149
DBG 2024-06-12 11:51:55.841-03:00 NewBE_callPrivateRPC                       topics="rpc" tid=4479673 file=core.nim:27 rpc_method=wakuext_speedupArchivesImport
INF 2024-06-12 11:52:00.956-03:00 active mailserver changed                  topics="mailservers-service" tid=4479673 file=service.nim:151 node=/dns4/store-02.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAm9CQhsuwPR54q27kNj9iaQVfyRzTGKrhFmr94oD8ujU6P
ACTIVE MAILSERVER CHANGED: ref 0x10370f5f0 --> [nodeAddress = 0x10371ec40"/dns4/store-02.ac-cn-hongkong-c.shards.test.statusim.net/tcp/30303/p2p/16Uiu2HAm9CQhsuwPR54q27kNj9iaQVfyRzTGKrhFmr94oD8ujU6P"]
INF 2024-06-12 11:52:00.958-03:00 mailserver available                       topics="mailservers-service" tid=4479673 file=service.nim:156

Expected behavior

Attaching screenshots and screen recordings

Today: https://github.com/status-im/status-desktop/assets/68916150/746cde6e-01d7-4d42-8240-ea3ab3e0df1d

Yesterday (11th June) Screenshot 2024-06-11 at 7 25 30 PM

https://github.com/status-im/status-desktop/assets/68916150/cd7caf28-b753-405a-970f-1ef87e6cf5fe

Additional Information

This issue has been happening since yesterday.

jakubgs commented 3 months ago

Inbound traffic to nodes is low:

image

And number of waku errors negligible:

image

jakubgs commented 3 months ago

Peers graph show some weird drops, but those are probably node restarts:

image

jakubgs commented 3 months ago

Interestingly store-01.do-ams3.shards.test node was stuck restarting due to failed DB migrations:

jakubgs@store-01.do-ams3.shards.test:~ % tail -n1 /var/log/docker/nim-waku-store/docker.log
2024-06-12T14:22:26.091446+00:00 docker/nim-waku-store[791]: ERR 2024-06-12 14:22:26.091+00:00 Waku initialization failed                 topics="wakunode main" tid=1 file=wakunode2.nim:78 error="Failed setting up node: Mounting protocols failed: failed to setup archive driver: ArchiveDriver build failed in migration: migrate error could not retrieve current version: error in getCurrentVersion-existsTable: existsTable failed in getRow: connRes.isErr in query: failed to stablish a new connection: exception opening new connection: connection to server at \"store-db-01.do-ams3.shards.test.wg\" (10.10.0.182), port 5432 failed: FATAL:  remaining connection slots are reserved for non-replication superuser connections\n"
jakubgs commented 3 months ago

The DB host in EU shows extremely high load considering it has only one core:

image

The load spike is very abrupt:

image

And coincides with 100% disk usage:

image

Of which most is reads:

Which suggests this is an issue with big queries, probably unoptimized:

image

jakubgs commented 3 months ago

I have doubled the size of DB hosts in shard.test fleet as a test:

Lets see what happens. But my bet is we are doing some big queries that lack an index or something similar.

jakubgs commented 3 months ago

Currently EU DB node has been running for an hour and the CPU load hovers around 2.0, which is much better, but still a bit too much for a machine with 2 cores:

image

image

anastasiyaig commented 3 months ago

i am still able to reproduce the issue, interesting i find in geth log is this.

So i just opened the app (master local build commit 60b1ec82ec19f976c1fb74c27fc4d10c815d2628 and try to click the community link

INFO [06-12|18:58:42.972|github.com/status-im/status-go/protocol/messenger.go:362] failed to find Waku service; trying WakuV2 error="waku service is not available"

geth.log

friofry commented 3 months ago

I've created a duplicate. There are some logs in the ticket: https://github.com/status-im/status-desktop/issues/15154

jakubgs commented 3 months ago

The CPU load situation has definitely improved:

image

And disk utilization is definitely better, even if it still hits 100%:

image

But this is still a bit too high load for a machine with 2 cores.

But despite that queries should be answered at this point. They might be slow at times but it should work for the most part.

jakubgs commented 3 months ago

The fleet appears to be fully available based on our canary checks:

image

https://canary.infra.status.im/service/142/

So the issue seems to be in the nim-waku node code, because the DB load no longer explains this behavior.

jakubgs commented 3 months ago

If we wanted to confirm the DB is not the issue we'd have to log the actual queries being made, make them by hand, and confirm the answer is what is expected.

I believe @Ivansete-status is already performing debugging of that kind on shards.staging fleet. Isn't that right?

Ivansete-status commented 3 months ago

I believe @Ivansete-status is already performing debugging of that kind on shards.staging fleet. Isn't that right?

Hi @jakubgs - Yes, I looked at it last week and didn't reach a conclusion. Nevertheless, I think it is better to only bring database optimizations within StoreV3 given that we are willing to decommission StoreV2 shortly.

jakubgs commented 3 months ago

No idea what that means.

Ivansete-status commented 3 months ago

No idea what that means.

We are willing to switch from one version of the Store protocol to other in mid-July. Therefore, I think it is better to focus on database optimizations for the new Store protocol, i.e. StoreV3. See https://github.com/waku-org/specs/blob/master/standards/core/store.md for more details. Ping me if it doesn't get clearer and I'm happy to talk about it.

jakubgs commented 3 months ago

Oh I see, so a new version of store protocol will bring about with it also a new DB schema for the stored messages. And for that new schema you want to apply optimizations? Well, not sure, what "shortly" means tho in this context. Issues are present right now, no didea what your timeline for StoreV3 is.

Ivansete-status commented 3 months ago

Hey @jakubgs! Thanks for your comment. I absolutely agree that this is super important and relevant.

When I mentioned "to only bring database optimizations within StoreV3" I meant to only perform analysis of queries' performance (e.g., EXPLAIN ANALYZE ...) in StoreV3, given that we are willing to migrate to StoreV3 next month. Also, notice that the schema will remain the same because we already prepared it for StoreV3, and only the query set changes (cc @SionoiS.)

@virginiabalducci - I just sent you a contact request on Discord so that we can schedule a session to try to replicate the issue again, or at least, I could learn how to replicate it :)

Ivansete-status commented 3 months ago

Today @virginiabalducci and I had an interesting pair-programming session where we chased to replicate the issue. From it, Virginia was using a test user who wasn't able to get into a community, i.e. after clicking the community invite (https://status.app/c/ixOACiYKBVRyZWVzEg9hbGwgYWJvdXQgdHJlZXMYAiIHIzQwRDY0QyoBDAM=#zQ3shiN42kkpzervYp8sRvzuPR9xtzEBtnW2ttSMryhnB7AjH), the app didn't do anything. We discovered though, that the nwaku node actually handled some store requests from Virginia's node.

These are the logs we got from store-02.gc-us-central1-a.shards.test (the same happened after clicking the link three times:)

2024-06-28T15:33:18.827543+00:00 docker/nim-waku-store[1782]: INF 2024-06-28 15:33:18.827+00:00 received history query                     topics="waku legacy store" tid=1 file=protocol.nim:69 peerId=16U*4p7iY5 requestId=4f35602f26cb761f07850f92af4faed6796919d033eebffb8b00de075913b1d8 query="(pubsubTopic: some(\"/waku/2/rs/16/64\"), contentTopics: @[\"/waku/1/0x09843b58/rfc26\"], cursor: none(HistoryCursor), startTime: some(1716910398000000000), endTime: some(1719588799000000000), pageSize: 4, direction: BACKWARD)"
2024-06-28T15:33:18.831205+00:00 docker/nim-waku-store[1782]: INF 2024-06-28 15:33:18.831+00:00 sending history response                   topics="waku legacy store" tid=1 file=protocol.nim:93 peerId=16U*4p7iY5 requestId=4f35602f26cb761f07850f92af4faed6796919d033eebffb8b00de075913b1d8 messages=0
2024-06-28T15:33:19.002692+00:00 docker/nim-waku-store[1782]: INF 2024-06-28 15:33:19.002+00:00 received history query                     topics="waku legacy store" tid=1 file=protocol.nim:69 peerId=16U*4p7iY5 requestId=0b70aa82553fdb5b1957b3e316888d91acbb4eb8202fece5877d03a3be3df21c query="(pubsubTopic: some(\"/waku/2/rs/16/32\"), contentTopics: @[\"/waku/1/0xff0cea5d/rfc26\"], cursor: none(HistoryCursor), startTime: some(1716910398000000000), endTime: some(1719588799000000000), pageSize: 4, direction: BACKWARD)"
2024-06-28T15:33:19.036037+00:00 docker/nim-waku-store[1782]: INF 2024-06-28 15:33:19.035+00:00 sending history response                   topics="waku legacy store" tid=1 file=protocol.nim:93 peerId=16U*4p7iY5 requestId=0b70aa82553fdb5b1957b3e316888d91acbb4eb8202fece5877d03a3be3df21c messages=1

It doesn't sound correct that the query only returns 0 messages for contenttopic /waku/1/0x09843b58/rfc26 and returns 1 message for contenttopic /waku/1/0xff0cea5d/rfc26. @cammellos , @richard-ramos - do you know what could be wrong?

On the other hand, if I try to access the community with a brand new user, the app showed the following image: image

richard-ramos commented 3 months ago

For the first query:

2024-06-28T15:33:18.827543+00:00 docker/nim-waku-store[1782]: INF 2024-06-28 15:33:18.827+00:00 received history query                     topics="waku legacy store" tid=1 file=protocol.nim:69 peerId=16U*4p7iY5 requestId=4f35602f26cb761f07850f92af4faed6796919d033eebffb8b00de075913b1d8 query="(pubsubTopic: some(\"/waku/2/rs/16/64\"), contentTopics: @[\"/waku/1/0x09843b58/rfc26\"], cursor: none(HistoryCursor), startTime: some(1716910398000000000), endTime: some(1719588799000000000), pageSize: 4, direction: BACKWARD)"
2024-06-28T15:33:18.831205+00:00 docker/nim-waku-store[1782]: INF 2024-06-28 15:33:18.831+00:00 sending history response                   topics="waku legacy store" tid=1 file=protocol.nim:93 peerId=16U*4p7iY5 requestId=4f35602f26cb761f07850f92af4faed6796919d033eebffb8b00de075913b1d8 messages=0

Note that the pubsub topic being used is /waku/2/rs/16/64. This pubsub topic is used to send/receive community join requests. Using the message finder tool i could see indeed that nwaku did not return any message for these parameters. I think that this is a valid result.


For the second query:

2024-06-28T15:33:19.002692+00:00 docker/nim-waku-store[1782]: INF 2024-06-28 15:33:19.002+00:00 received history query                     topics="waku legacy store" tid=1 file=protocol.nim:69 peerId=16U*4p7iY5 requestId=0b70aa82553fdb5b1957b3e316888d91acbb4eb8202fece5877d03a3be3df21c query="(pubsubTopic: some(\"/waku/2/rs/16/32\"), contentTopics: @[\"/waku/1/0xff0cea5d/rfc26\"], cursor: none(HistoryCursor), startTime: some(1716910398000000000), endTime: some(1719588799000000000), pageSize: 4, direction: BACKWARD)"
2024-06-28T15:33:19.036037+00:00 docker/nim-waku-store[1782]: INF 2024-06-28 15:33:19.035+00:00 sending history response                   topics="waku legacy store" tid=1 file=protocol.nim:93 peerId=16U*4p7iY5 requestId=0b70aa82553fdb5b1957b3e316888d91acbb4eb8202fece5877d03a3be3df21c messages=1

Using the message finder I could see that indeed there's only a single message available for the time range. Looking at that message payload, i can see it has 770588 bytes of length. This seems to be a community description message, so seems to me that the store message retrieval is working fine. Perhaps the problem being then status-go or desktop not reacting to the message?

virginiabalducci commented 2 months ago

Hi, I'm experiencing issues again when trying to access a community by clicking on its link

I'm using this nightly version StatusIm-Desktop-240717-125009-9f03c2-nightly-aarch64.dmg

The profile I'm using is brand new.

https://github.com/user-attachments/assets/ce4974c8-41ab-42e7-a305-7b3f6fa35268

Console log

DBG 2024-07-18 23:14:12.555-03:00 [threadpool task thread] initiating task   topics="task-threadpool" tid=1917625 file=threadpool.nim:56 messageType=AsyncGetProfileShowcaseForContactTaskArg:ObjectType threadid=1917625 task="{\"$type\":\"AsyncGetProfileShowcaseForContactTaskArg:ObjectType\",\"pubkey\":\"0x0439ec7ad348caf59129f10f6fab46f9c9e4bcb7b8960415754856f9d661482284106d91a3796a30428fcd4bc2f2868f83968bada041973c1ef6609b573feada13\",\"validate\":false,\"vptr\":105553119046560,\"slot\":\"asyncProfileShowcaseForContactLoaded\"}"
DBG 2024-07-18 23:14:12.555-03:00 NewBE_callPrivateRPC                       topics="rpc" tid=1917625 file=core.nim:27 rpc_method=wakuext_getProfileShowcaseForContact
DBG 2024-07-18 23:14:12.557-03:00 [threadpool task thread] initiating task   topics="task-threadpool" tid=1917631 file=threadpool.nim:56 messageType=AsyncGetProfileShowcaseForContactTaskArg:ObjectType threadid=1917631 task="{\"$type\":\"AsyncGetProfileShowcaseForContactTaskArg:ObjectType\",\"pubkey\":\"0x0439ec7ad348caf59129f10f6fab46f9c9e4bcb7b8960415754856f9d661482284106d91a3796a30428fcd4bc2f2868f83968bada041973c1ef6609b573feada13\",\"validate\":true,\"vptr\":105553119046560,\"slot\":\"asyncProfileShowcaseForContactLoaded\"}"
DBG 2024-07-18 23:14:12.557-03:00 NewBE_callPrivateRPC                       topics="rpc" tid=1917631 file=core.nim:27 rpc_method=wakuext_getProfileShowcaseForContact
DBG 2024-07-18 23:14:12.728-03:00 NewBE_callPrivateRPC                       topics="rpc" tid=1917609 file=core.nim:27 rpc_method=wakuext_getLatestVerificationRequestFrom
DBG 2024-07-18 23:14:15.352-03:00 NewBE_callPrivateRPC                       topics="rpc" tid=1917609 file=core.nim:27 rpc_method=wakuext_parseSharedURL
DBG 2024-07-18 23:14:15.353-03:00 [threadpool task thread] initiating task   topics="task-threadpool" tid=1917628 file=threadpool.nim:56 messageType=AsyncRequestCommunityInfoTaskArg:ObjectType threadid=1917628 task="{\"$type\":\"AsyncRequestCommunityInfoTaskArg:ObjectType\",\"communityId\":\"0x027fc8caa29e9dd562d4fca578836d2f861d7a9899c709b2e555e4befb784b33f2\",\"importing\":false,\"tryDatabase\":true,\"shardCluster\":-1,\"shardIndex\":-1,\"vptr\":105553119046272,\"slot\":\"asyncCommunityInfoLoaded\"}"
DBG 2024-07-18 23:14:15.354-03:00 NewBE_callPrivateRPC                       topics="rpc" tid=1917628 file=core.nim:27 rpc_method=wakuext_fetchCommunity
DBG 2024-07-18 23:14:17.658-03:00 NewBE_callPrivateRPC                       topics="rpc" tid=1917609 file=core.nim:27 rpc_method=wakuext_parseSharedURL
DBG 2024-07-18 23:14:17.659-03:00 [threadpool task thread] initiating task   topics="task-threadpool" tid=1917634 file=threadpool.nim:56 messageType=AsyncRequestCommunityInfoTaskArg:ObjectType threadid=1917634 task="{\"$type\":\"AsyncRequestCommunityInfoTaskArg:ObjectType\",\"communityId\":\"0x027fc8caa29e9dd562d4fca578836d2f861d7a9899c709b2e555e4befb784b33f2\",\"importing\":false,\"tryDatabase\":true,\"shardCluster\":-1,\"shardIndex\":-1,\"vptr\":105553119046272,\"slot\":\"asyncCommunityInfoLoaded\"}"
DBG 2024-07-18 23:14:17.660-03:00 NewBE_callPrivateRPC                       topics="rpc" tid=1917634 file=core.nim:27 rpc_method=wakuext_fetchCommunity
DBG 2024-07-18 23:14:24.110-03:00 NewBE_callPrivateRPC                       topics="rpc" tid=1917609 file=core.nim:27 rpc_method=wakuext_speedupArchivesImport
DBG 2024-07-18 23:14:38.827-03:00 NewBE_callPrivateRPC                       topics="rpc" tid=1917609 file=core.nim:27 rpc_method=wakuext_slowdownArchivesImport

Geth log geth copy.log.zip

virginiabalducci commented 4 days ago

This issue is no longer reproducible on latest master.