Closed virginiabalducci closed 2 months ago
Inbound traffic to nodes is low:
And number of waku errors negligible:
Peers graph show some weird drops, but those are probably node restarts:
Interestingly store-01.do-ams3.shards.test
node was stuck restarting due to failed DB migrations:
jakubgs@store-01.do-ams3.shards.test:~ % tail -n1 /var/log/docker/nim-waku-store/docker.log
2024-06-12T14:22:26.091446+00:00 docker/nim-waku-store[791]: ERR 2024-06-12 14:22:26.091+00:00 Waku initialization failed topics="wakunode main" tid=1 file=wakunode2.nim:78 error="Failed setting up node: Mounting protocols failed: failed to setup archive driver: ArchiveDriver build failed in migration: migrate error could not retrieve current version: error in getCurrentVersion-existsTable: existsTable failed in getRow: connRes.isErr in query: failed to stablish a new connection: exception opening new connection: connection to server at \"store-db-01.do-ams3.shards.test.wg\" (10.10.0.182), port 5432 failed: FATAL: remaining connection slots are reserved for non-replication superuser connections\n"
The DB host in EU shows extremely high load considering it has only one core:
The load spike is very abrupt:
And coincides with 100% disk usage:
Of which most is reads:
Which suggests this is an issue with big queries, probably unoptimized:
I have doubled the size of DB hosts in shard.test
fleet as a test:
infra-shards#a8162303
- store-db: double size of hosts to handle big queriesLets see what happens. But my bet is we are doing some big queries that lack an index or something similar.
Currently EU DB node has been running for an hour and the CPU load hovers around 2.0, which is much better, but still a bit too much for a machine with 2 cores:
i am still able to reproduce the issue, interesting i find in geth log is this.
So i just opened the app (master local build commit 60b1ec82ec19f976c1fb74c27fc4d10c815d2628 and try to click the community link
INFO [06-12|18:58:42.972|github.com/status-im/status-go/protocol/messenger.go:362] failed to find Waku service; trying WakuV2 error="waku service is not available"
I've created a duplicate. There are some logs in the ticket: https://github.com/status-im/status-desktop/issues/15154
The CPU load situation has definitely improved:
And disk utilization is definitely better, even if it still hits 100%:
But this is still a bit too high load for a machine with 2 cores.
But despite that queries should be answered at this point. They might be slow at times but it should work for the most part.
The fleet appears to be fully available based on our canary checks:
https://canary.infra.status.im/service/142/
So the issue seems to be in the nim-waku
node code, because the DB load no longer explains this behavior.
If we wanted to confirm the DB is not the issue we'd have to log the actual queries being made, make them by hand, and confirm the answer is what is expected.
I believe @Ivansete-status is already performing debugging of that kind on shards.staging
fleet. Isn't that right?
I believe @Ivansete-status is already performing debugging of that kind on
shards.staging
fleet. Isn't that right?
Hi @jakubgs - Yes, I looked at it last week and didn't reach a conclusion. Nevertheless, I think it is better to only bring database optimizations within StoreV3 given that we are willing to decommission StoreV2 shortly.
No idea what that means.
No idea what that means.
We are willing to switch from one version of the Store protocol to other in mid-July. Therefore, I think it is better to focus on database optimizations for the new Store protocol, i.e. StoreV3. See https://github.com/waku-org/specs/blob/master/standards/core/store.md for more details. Ping me if it doesn't get clearer and I'm happy to talk about it.
Oh I see, so a new version of store protocol will bring about with it also a new DB schema for the stored messages. And for that new schema you want to apply optimizations? Well, not sure, what "shortly" means tho in this context. Issues are present right now, no didea what your timeline for StoreV3 is.
Hey @jakubgs! Thanks for your comment. I absolutely agree that this is super important and relevant.
When I mentioned "to only bring database optimizations within StoreV3" I meant to only perform analysis of queries' performance (e.g., EXPLAIN ANALYZE ...
) in StoreV3, given that we are willing to migrate to StoreV3 next month. Also, notice that the schema will remain the same because we already prepared it for StoreV3, and only the query set changes (cc @SionoiS.)
@virginiabalducci - I just sent you a contact request on Discord so that we can schedule a session to try to replicate the issue again, or at least, I could learn how to replicate it :)
Today @virginiabalducci and I had an interesting pair-programming session where we chased to replicate the issue.
From it, Virginia was using a test user who wasn't able to get into a community, i.e. after clicking the community invite (https://status.app/c/ixOACiYKBVRyZWVzEg9hbGwgYWJvdXQgdHJlZXMYAiIHIzQwRDY0QyoBDAM=#zQ3shiN42kkpzervYp8sRvzuPR9xtzEBtnW2ttSMryhnB7AjH), the app didn't do anything. We discovered though, that the nwaku
node actually handled some store requests from Virginia's node.
These are the logs we got from store-02.gc-us-central1-a.shards.test
(the same happened after clicking the link three times:)
2024-06-28T15:33:18.827543+00:00 docker/nim-waku-store[1782]: INF 2024-06-28 15:33:18.827+00:00 received history query topics="waku legacy store" tid=1 file=protocol.nim:69 peerId=16U*4p7iY5 requestId=4f35602f26cb761f07850f92af4faed6796919d033eebffb8b00de075913b1d8 query="(pubsubTopic: some(\"/waku/2/rs/16/64\"), contentTopics: @[\"/waku/1/0x09843b58/rfc26\"], cursor: none(HistoryCursor), startTime: some(1716910398000000000), endTime: some(1719588799000000000), pageSize: 4, direction: BACKWARD)"
2024-06-28T15:33:18.831205+00:00 docker/nim-waku-store[1782]: INF 2024-06-28 15:33:18.831+00:00 sending history response topics="waku legacy store" tid=1 file=protocol.nim:93 peerId=16U*4p7iY5 requestId=4f35602f26cb761f07850f92af4faed6796919d033eebffb8b00de075913b1d8 messages=0
2024-06-28T15:33:19.002692+00:00 docker/nim-waku-store[1782]: INF 2024-06-28 15:33:19.002+00:00 received history query topics="waku legacy store" tid=1 file=protocol.nim:69 peerId=16U*4p7iY5 requestId=0b70aa82553fdb5b1957b3e316888d91acbb4eb8202fece5877d03a3be3df21c query="(pubsubTopic: some(\"/waku/2/rs/16/32\"), contentTopics: @[\"/waku/1/0xff0cea5d/rfc26\"], cursor: none(HistoryCursor), startTime: some(1716910398000000000), endTime: some(1719588799000000000), pageSize: 4, direction: BACKWARD)"
2024-06-28T15:33:19.036037+00:00 docker/nim-waku-store[1782]: INF 2024-06-28 15:33:19.035+00:00 sending history response topics="waku legacy store" tid=1 file=protocol.nim:93 peerId=16U*4p7iY5 requestId=0b70aa82553fdb5b1957b3e316888d91acbb4eb8202fece5877d03a3be3df21c messages=1
It doesn't sound correct that the query only returns 0
messages for contenttopic /waku/1/0x09843b58/rfc26
and returns 1
message for contenttopic /waku/1/0xff0cea5d/rfc26
. @cammellos , @richard-ramos - do you know what could be wrong?
On the other hand, if I try to access the community with a brand new user, the app showed the following image:
For the first query:
2024-06-28T15:33:18.827543+00:00 docker/nim-waku-store[1782]: INF 2024-06-28 15:33:18.827+00:00 received history query topics="waku legacy store" tid=1 file=protocol.nim:69 peerId=16U*4p7iY5 requestId=4f35602f26cb761f07850f92af4faed6796919d033eebffb8b00de075913b1d8 query="(pubsubTopic: some(\"/waku/2/rs/16/64\"), contentTopics: @[\"/waku/1/0x09843b58/rfc26\"], cursor: none(HistoryCursor), startTime: some(1716910398000000000), endTime: some(1719588799000000000), pageSize: 4, direction: BACKWARD)"
2024-06-28T15:33:18.831205+00:00 docker/nim-waku-store[1782]: INF 2024-06-28 15:33:18.831+00:00 sending history response topics="waku legacy store" tid=1 file=protocol.nim:93 peerId=16U*4p7iY5 requestId=4f35602f26cb761f07850f92af4faed6796919d033eebffb8b00de075913b1d8 messages=0
Note that the pubsub topic being used is /waku/2/rs/16/64
. This pubsub topic is used to send/receive community join requests. Using the message finder tool i could see indeed that nwaku did not return any message for these parameters.
I think that this is a valid result.
For the second query:
2024-06-28T15:33:19.002692+00:00 docker/nim-waku-store[1782]: INF 2024-06-28 15:33:19.002+00:00 received history query topics="waku legacy store" tid=1 file=protocol.nim:69 peerId=16U*4p7iY5 requestId=0b70aa82553fdb5b1957b3e316888d91acbb4eb8202fece5877d03a3be3df21c query="(pubsubTopic: some(\"/waku/2/rs/16/32\"), contentTopics: @[\"/waku/1/0xff0cea5d/rfc26\"], cursor: none(HistoryCursor), startTime: some(1716910398000000000), endTime: some(1719588799000000000), pageSize: 4, direction: BACKWARD)"
2024-06-28T15:33:19.036037+00:00 docker/nim-waku-store[1782]: INF 2024-06-28 15:33:19.035+00:00 sending history response topics="waku legacy store" tid=1 file=protocol.nim:93 peerId=16U*4p7iY5 requestId=0b70aa82553fdb5b1957b3e316888d91acbb4eb8202fece5877d03a3be3df21c messages=1
Using the message finder I could see that indeed there's only a single message available for the time range. Looking at that message payload, i can see it has 770588 bytes of length. This seems to be a community description message, so seems to me that the store message retrieval is working fine. Perhaps the problem being then status-go or desktop not reacting to the message?
Hi, I'm experiencing issues again when trying to access a community by clicking on its link
I'm using this nightly version StatusIm-Desktop-240717-125009-9f03c2-nightly-aarch64.dmg
The profile I'm using is brand new.
https://github.com/user-attachments/assets/ce4974c8-41ab-42e7-a305-7b3f6fa35268
Console log
DBG 2024-07-18 23:14:12.555-03:00 [threadpool task thread] initiating task topics="task-threadpool" tid=1917625 file=threadpool.nim:56 messageType=AsyncGetProfileShowcaseForContactTaskArg:ObjectType threadid=1917625 task="{\"$type\":\"AsyncGetProfileShowcaseForContactTaskArg:ObjectType\",\"pubkey\":\"0x0439ec7ad348caf59129f10f6fab46f9c9e4bcb7b8960415754856f9d661482284106d91a3796a30428fcd4bc2f2868f83968bada041973c1ef6609b573feada13\",\"validate\":false,\"vptr\":105553119046560,\"slot\":\"asyncProfileShowcaseForContactLoaded\"}"
DBG 2024-07-18 23:14:12.555-03:00 NewBE_callPrivateRPC topics="rpc" tid=1917625 file=core.nim:27 rpc_method=wakuext_getProfileShowcaseForContact
DBG 2024-07-18 23:14:12.557-03:00 [threadpool task thread] initiating task topics="task-threadpool" tid=1917631 file=threadpool.nim:56 messageType=AsyncGetProfileShowcaseForContactTaskArg:ObjectType threadid=1917631 task="{\"$type\":\"AsyncGetProfileShowcaseForContactTaskArg:ObjectType\",\"pubkey\":\"0x0439ec7ad348caf59129f10f6fab46f9c9e4bcb7b8960415754856f9d661482284106d91a3796a30428fcd4bc2f2868f83968bada041973c1ef6609b573feada13\",\"validate\":true,\"vptr\":105553119046560,\"slot\":\"asyncProfileShowcaseForContactLoaded\"}"
DBG 2024-07-18 23:14:12.557-03:00 NewBE_callPrivateRPC topics="rpc" tid=1917631 file=core.nim:27 rpc_method=wakuext_getProfileShowcaseForContact
DBG 2024-07-18 23:14:12.728-03:00 NewBE_callPrivateRPC topics="rpc" tid=1917609 file=core.nim:27 rpc_method=wakuext_getLatestVerificationRequestFrom
DBG 2024-07-18 23:14:15.352-03:00 NewBE_callPrivateRPC topics="rpc" tid=1917609 file=core.nim:27 rpc_method=wakuext_parseSharedURL
DBG 2024-07-18 23:14:15.353-03:00 [threadpool task thread] initiating task topics="task-threadpool" tid=1917628 file=threadpool.nim:56 messageType=AsyncRequestCommunityInfoTaskArg:ObjectType threadid=1917628 task="{\"$type\":\"AsyncRequestCommunityInfoTaskArg:ObjectType\",\"communityId\":\"0x027fc8caa29e9dd562d4fca578836d2f861d7a9899c709b2e555e4befb784b33f2\",\"importing\":false,\"tryDatabase\":true,\"shardCluster\":-1,\"shardIndex\":-1,\"vptr\":105553119046272,\"slot\":\"asyncCommunityInfoLoaded\"}"
DBG 2024-07-18 23:14:15.354-03:00 NewBE_callPrivateRPC topics="rpc" tid=1917628 file=core.nim:27 rpc_method=wakuext_fetchCommunity
DBG 2024-07-18 23:14:17.658-03:00 NewBE_callPrivateRPC topics="rpc" tid=1917609 file=core.nim:27 rpc_method=wakuext_parseSharedURL
DBG 2024-07-18 23:14:17.659-03:00 [threadpool task thread] initiating task topics="task-threadpool" tid=1917634 file=threadpool.nim:56 messageType=AsyncRequestCommunityInfoTaskArg:ObjectType threadid=1917634 task="{\"$type\":\"AsyncRequestCommunityInfoTaskArg:ObjectType\",\"communityId\":\"0x027fc8caa29e9dd562d4fca578836d2f861d7a9899c709b2e555e4befb784b33f2\",\"importing\":false,\"tryDatabase\":true,\"shardCluster\":-1,\"shardIndex\":-1,\"vptr\":105553119046272,\"slot\":\"asyncCommunityInfoLoaded\"}"
DBG 2024-07-18 23:14:17.660-03:00 NewBE_callPrivateRPC topics="rpc" tid=1917634 file=core.nim:27 rpc_method=wakuext_fetchCommunity
DBG 2024-07-18 23:14:24.110-03:00 NewBE_callPrivateRPC topics="rpc" tid=1917609 file=core.nim:27 rpc_method=wakuext_speedupArchivesImport
DBG 2024-07-18 23:14:38.827-03:00 NewBE_callPrivateRPC topics="rpc" tid=1917609 file=core.nim:27 rpc_method=wakuext_slowdownArchivesImport
Geth log geth copy.log.zip
This issue is no longer reproducible on latest master.
Bug Report
Description
User cannot join a community, clicking on the community link does nothing.
Steps to reproduce
There are console errors:
Expected behavior
Actual behavior
Attaching screenshots and screen recordings
Today: https://github.com/status-im/status-desktop/assets/68916150/746cde6e-01d7-4d42-8240-ea3ab3e0df1d
Yesterday (11th June)
https://github.com/status-im/status-desktop/assets/68916150/cd7caf28-b753-405a-970f-1ef87e6cf5fe
Additional Information
This issue has been happening since yesterday.