pocketnetteam / pocketnet.core

Decentralized social network based on the blockchain
https://pocketnet.app
Apache License 2.0
109 stars 28 forks source link

0.22 One cpu core is at 100% and node shows not responding in the GUI and the RPC can take ages to respond #658

Open the-real-vortex-v opened 6 months ago

the-real-vortex-v commented 6 months ago

Describe the bug Version 0.22 of the node software has one cpu core/thread at 100% and node shows not responding in the GUI and the RPC can take ages to respond. My node had the off chain issue. I stopped my node. I downloaded the 19/1/2024 latest.tgz from the snapshot website. Deleted everything in pocketnet from the sql,database dirs (all the directories). Uncompressed the archive. Restarted the node. It took about 12hrs to "update sql format" etc. Now the node finally caught up after another 12 hrs or so but the node repeatedly shows "not responding" in the GUI window which means I can't check anything like connected peers etc. I can rpc to it but it takes a long time to do one command. We are talking minutes.

I know I asked about these kinds of issues before but it's Never been this bad. This is really bad in the last couple of updates.

update: So the SQL file main.sqlite3 is being updated and ends up being around 60gb. The latest.tgz from the pocketnet snapshot download is only around 24gb or so. I was advised by andy that 24gb should be the right size as of the ~19th or so. Why is it expanding to around 60gb? I've talked to several other node operators and all of us have 60gb+ sql files. Something is seriously wrong if it should not be 60gb and one core/thread is pegged at 100% all the time.

To Reproduce Steps to reproduce the behavior: Update chain. Run node. UI is unresponsive.

Expected behavior The UI should be responsive without minutes of "unresponsive" periods (not an exagerration). One to 5 seconds of "unresponsive" followed by a responsive UI is how it used to be. Now it's actual minutes in a row.

Screenshots

image

Desktop (please complete the following information): Windows 10

Additional context

This has been getting progressively worse over the last few updates (maybe 3 or 4?). I was told it may be due to a QT v5 threading issue because the main thread gets locked up and for what ever reason QT gets no time. Is there any work on converting to QT6?

This issue also happens during the "processing blocks on disk" phase. There is no reason for this to happen. Blocking IO like this is really bad. I don't think this is a GUI issue but maybe rather a SQL issue? These issues got much worse with the shift towards SQL.

image

the-real-vortex-v commented 6 months ago

I'm trying a different snapshot that has the sql files updated (they are about 60gb) so it won't take ~12hrs to update the sql database and I'll see if that helps with the freezing.

the-real-vortex-v commented 6 months ago

I can confirm there is some kind of issue with the SQL/decoding etc code in the node. I am using a different snapshot provided by one of the members of the node group and the constant lockup issue is not happening. It seems that something is malformed/broken. The sql file in the correct version is ~64gb as opposed to the ~23gb of the one provided in the latest.tgz

The freezing came back when it was syncing headers. It's not as bad as before.

image

update: The node keeps dropping connected nodes and "unknown, syncing headers" is displayed and even though there are no active network connections it's unresponsive.

andyoknen commented 6 months ago

As for the snapshot size, the correct size is about 20 GB, since the database was reduced for version 0.22. It also does not contain indexes that are built when the node is first started. Perhaps there is something strange in the logs?

the-real-vortex-v commented 6 months ago

As for the snapshot size, the correct size is about 20 GB, since the database was reduced for version 0.22. It also does not contain indexes that are built when the node is first started. Perhaps there is something strange in the logs?

The snapshot balloons to about 34gb or 60gb depending on if you have pruning turned on or not. Nothing particularly strange. It's the usual scanning blocks,converting sql and then just "Block connected to chain" etc. No error messages apart from nodes timing out (it reminds me of the v18/v19 etc consensus issues).

I'm able to torrent,watch youtube and do other normal internet activities on this pc. It's not a direct connection but natted but this has been working fine for the last several versions this way (since v19). I think there is something to do with how it's doing SQL stuff. On a side note

https://github.com/bitcoin-core/secp256k1/blob/master/CHANGELOG.md#041---2023-12-21

I think some of code may want to be updated. The above update apparently has a speed increase in the code. I can't find what version of the code pocketnet uses but it looks old. There's no version numbers obviously available in the readme or header etc.

It looks like a normal log to me. Do you want to see the snapshot? I made a torrent of it. Let me know if you want it.

update: As a test I just deleted all of the database and decided to see if resyncing from zero is different. So far the syncing headers is not causing the node to freeze up.

However as soon as it started connecting blocks the node freezes up when it does the "block connected to chain". It never seems to catch up and I get waves of time outs on the networking side.

Logs:

2024-01-24T03:55:22Z Synchronizing blockheaders, height: 2106000 (~81.28%) 2024-01-24T03:55:22Z Synchronizing blockheaders, height: 2108000 (~81.36%) 2024-01-24T03:55:23Z Synchronizing blockheaders, height: 2110000 (~81.44%) 2024-01-24T03:55:23Z +++ Block connected to chain: 1 BH: 00000d2107354549b8143ca4ebd51364c122aad142a8e910cbd73a579e48a2c0 2024-01-24T03:55:23Z +++ Block connected to chain: 2 BH: 000001bda99e6a3820da5a1eab34638f50cd49a91449fc6229c4c7469bb0179a 2024-01-24T03:55:23Z +++ Block connected to chain: 3 BH: 000008cf7b056678b3838b27105be7da21b5b5f150bd8a9da3ccfc4c32929a2d 2024-01-24T03:55:23Z +++ Block connected to chain: 4 BH: 00000e2b0853032f8747197dadac8109d319de5472433072814dffacfcc08e35 2024-01-24T03:55:23Z +++ Block connected to chain: 5 BH: 00000a9b99e247d38b1ec7e25ca466d7718e0451d2c2cac9f782cd02718f296d 2024-01-24T03:55:23Z +++ Block connected to chain: 6 BH: 00000924b3506268351334d242520f2c7770b8b72c0fc7c16b7893836090d529 2024-01-24T03:55:23Z +++ Block connected to chain: 7 BH: 0000084faf6407e81489d2c6c16dc4bddb4d22a5a4ce77de6f40bbc5a82aa476 2024-01-24T03:55:23Z +++ Block connected to chain: 8 BH: 00000446ae62975fe5330eb92883897f2b15852e3b4e36280935570084dcce7a 2024-01-24T03:55:23Z +++ Block connected to chain: 9 BH: 00000dae1fde9b191527c7c6c472313cfee959a527976f647ef61524f3634924 2024-01-24T03:55:23Z +++ Block connected to chain: 10 BH: 00000f47187faefc84c543eac29fcf0016753eb7bfd8b2e90245640692a464e1 2024-01-24T03:55:23Z +++ Block connected to chain: 11 BH: 00000b756aee799ae7648ba22472d1aa14a0d953b0ca188f401db4f231ce649a 2024-01-24T03:55:23Z +++ Block connected to chain: 12 BH: 000004161bc4ecbc3917c0998e206aaf11922e4c6f1e207399a89130357de93e 2024-01-24T03:55:23Z +++ Block connected to chain: 13 BH: 0000027396afc5b86ef53cd496141937e41c6cd6123942e4a7413d0026e09738 2024-01-24T03:55:23Z +++ Block connected to chain: 14 BH: 00000fa8eea79755647e872a7ed31e7ac782e382fa7e2d7c74e1e73e337da281 2024-01-24T03:55:23Z +++ Block connected to chain: 15 BH: 00000dd22c623baf707a2cca34d75c6ac766bf28b514ed0170111d16912f5ed8 2024-01-24T03:55:23Z +++ Block connected to chain: 16 BH: 00000f38e8c04b2421fbff9006e23a462faf639759fd81fca40a1392d0afa8b7 2024-01-24T03:55:23Z Synchronizing blockheaders, height: 2112000 (~81.51%)

Just got this error:

2024-01-24T03:58:59Z +++ Block connected to chain: 3001 BH: e56999e42679b5dde1227adfde35039a055c9e32e322664bbcdc31b2be8c82d0 2024-01-24T03:58:59Z Warning: content (3) field (1) not indexed in search db 2024-01-24T03:58:59Z Warning: content (3) field (1) not indexed in search db 2024-01-24T03:58:59Z +++ Block connected to chain: 3002 BH: 4fc8cc5ed6b71073be9e66e329e04c95b3171a07a62852ebd58a90f49b86c718

the-real-vortex-v commented 6 months ago

I'm getting lots of time outs from nodes. It looks like Bitcoin has updated how they deal with peers that time out a lot.

2024-01-27T21:52:39Z +++ Block connected to chain: 2599219 BH: 9aec3c6a62b39b09dd5f53875b18df9e34926a87933ae3960193d4cae46ed052 2024-01-27T21:52:45Z Synchronizing blockheaders, height: 2601114 (~100.00%) 2024-01-27T21:52:53Z Timeout downloading block 136a18e536a613565bfe42f9a38bcda91e823ca41efde36866daa84eb255ac85 from peer=8, disconnecting 2024-01-27T21:53:12Z Timeout downloading block 02bc2ad1c336b69bd802289f023faa247430ee2cf2174b58080a0a6423f7a2f3 from peer=5, disconnecting

There is a bitcoind code update that tries to address some of these issues: https://github.com/bitcoin/bitcoin/pull/27626

This is from last year v24 of bitcoin.

andyoknen commented 5 months ago

I have an assumption - most likely this is the process of pre-calculating account statistics. It runs every 60 blocks and can really take up a lot of resources. To verify this, try disabling the public subsystem: api=0

At the moment, the data is updated throughout the database and across all accounts, even if they have not changed in the last 60 blocks. This is a very ugly decision and I am in the process of refactoring this process to speed up data updates and reduce the load.