ton-blockchain / ton

Main TON monorepo
Other
2.97k stars 906 forks source link

Slow node synchronization #886

Open jzethar opened 8 months ago

jzethar commented 8 months ago

Slow node sync

Here I'm using an archive node without validating, only syncing blocks.

Server characteristics

Server has:

  1. 5 TB disk with zfs pool
  2. 256GB of RAM
  3. AMD EPYC 7763 64-Core Processor
  4. Debian 5.10.149-2

Update flow

After updating from v2023.10 to v2024.01, the node started to lag. The validator-engine is a self-compiled application from the TON repo and was compiled according to TON-repo recommendations. The flow of updating was next:

  1. Compile new version of node
  2. Stop validator
  3. Update validator
  4. Validator crashed (was found 5 hours later)
  5. Recompiled validator
  6. Validator is normally syncing

The node has been working for one-two days without crashing, but it's lagging and can't sync to the latest blocks. It's syncing (as seen from the logs), but abnormally slowly, and the gap between the last block and the synced block is increasing (8 hours ago it was 15k blocks, now it's 16k blocks).

Some logs

[ 3][t 7][2024-02-02 14:00:34.416560689][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 5][2024-02-02 14:00:34.416565629][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 4][2024-02-02 14:00:34.416572149][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 7][2024-02-02 14:00:34.416573309][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 6][2024-02-02 14:00:34.416573199][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 5][2024-02-02 14:00:34.416581159][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 7][2024-02-02 14:00:34.416588839][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 6][2024-02-02 14:00:34.416601799][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 5][2024-02-02 14:00:34.416602869][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 7][2024-02-02 14:00:34.416603579][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 7][2024-02-02 14:04:00.262468356][liteserver.cpp:234][!litequery]        started a getMasterchainInfo(-1) liteserver query
[ 3][t 7][2024-02-02 14:04:00.262475156][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 5][2024-02-02 14:04:00.262483716][liteserver.cpp:234][!litequery]        started a getMasterchainInfo(-1) liteserver query
[ 3][t 5][2024-02-02 14:04:00.262489546][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 5][2024-02-02 14:04:00.262500696][liteserver.cpp:234][!litequery]        started a getMasterchainInfo(-1) liteserver query
[ 3][t 5][2024-02-02 14:04:00.262506436][liteserver.cpp:741][!litequery]        started a getAccountState((-1,8000000000000000,35828298):EFC498336DA763DB2F9D5E27F59B99DA505297B826A207E8387A02C6F2FA3C52:69E123CD31F5AFB89A32EEAA6D1F7589734E1042ECB8D6A5CC7F795972E5FD9E, 0, 27D28A4C04F71995216C0E7CCA34DA08DCCB20836C3CE5119245B339DF102FDD, -2147483648) liteserver query
[ 3][t 5][2024-02-02 14:04:00.262529016][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 2][t 2][2024-02-02 14:04:00.262663247][adnl-ext-server.cpp:34][!manager]      failed ext query: [Error : 651 : node not synced]
[ 2][t 2][2024-02-02 14:04:00.262686567][adnl-ext-server.cpp:34][!manager]      failed ext query: [Error : 651 : node not synced]
[ 2][t 2][2024-02-02 14:04:00.262696887][adnl-ext-server.cpp:34][!manager]      failed ext query: [Error : 651 : node not synced]
[ 2][t 2][2024-02-02 14:04:00.262705788][adnl-ext-server.cpp:34][!manager]      failed ext query: [Error : 651 : node not synced]
^[[ 3][t 3][2024-02-02 14:04:02.325045744][download-archive-slice.cpp:148][!archive]    downloading archive slice #35828238 from yfnIJiL2oWKjJHHg7DzGs6IjxLnqWOxuWbUTcYSwrUw=

Is there any recommendation for solving? @EmelyanenkoK @akifoq @XaBbl4 @aleksej-paschenko

airstring commented 5 months ago

How did you deal with this problem?

awesome-doge commented 3 months ago

Slow node sync

Here I'm using an archive node without validating, only syncing blocks.

Server characteristics

Server has:

  1. 5 TB disk with zfs pool
  2. 256GB of RAM
  3. AMD EPYC 7763 64-Core Processor
  4. Debian 5.10.149-2

Update flow

After updating from v2023.10 to v2024.01, the node started to lag. The validator-engine is a self-compiled application from the TON repo and was compiled according to TON-repo recommendations. The flow of updating was next:

  1. Compile new version of node
  2. Stop validator
  3. Update validator
  4. Validator crashed (was found 5 hours later)
  5. Recompiled validator
  6. Validator is normally syncing

The node has been working for one-two days without crashing, but it's lagging and can't sync to the latest blocks. It's syncing (as seen from the logs), but abnormally slowly, and the gap between the last block and the synced block is increasing (8 hours ago it was 15k blocks, now it's 16k blocks).

Some logs

[ 3][t 7][2024-02-02 14:00:34.416560689][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 5][2024-02-02 14:00:34.416565629][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 4][2024-02-02 14:00:34.416572149][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 7][2024-02-02 14:00:34.416573309][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 6][2024-02-02 14:00:34.416573199][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 5][2024-02-02 14:00:34.416581159][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 7][2024-02-02 14:00:34.416588839][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 6][2024-02-02 14:00:34.416601799][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 5][2024-02-02 14:00:34.416602869][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 7][2024-02-02 14:00:34.416603579][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 7][2024-02-02 14:04:00.262468356][liteserver.cpp:234][!litequery]        started a getMasterchainInfo(-1) liteserver query
[ 3][t 7][2024-02-02 14:04:00.262475156][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 5][2024-02-02 14:04:00.262483716][liteserver.cpp:234][!litequery]        started a getMasterchainInfo(-1) liteserver query
[ 3][t 5][2024-02-02 14:04:00.262489546][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 3][t 5][2024-02-02 14:04:00.262500696][liteserver.cpp:234][!litequery]        started a getMasterchainInfo(-1) liteserver query
[ 3][t 5][2024-02-02 14:04:00.262506436][liteserver.cpp:741][!litequery]        started a getAccountState((-1,8000000000000000,35828298):EFC498336DA763DB2F9D5E27F59B99DA505297B826A207E8387A02C6F2FA3C52:69E123CD31F5AFB89A32EEAA6D1F7589734E1042ECB8D6A5CC7F795972E5FD9E, 0, 27D28A4C04F71995216C0E7CCA34DA08DCCB20836C3CE5119245B339DF102FDD, -2147483648) liteserver query
[ 3][t 5][2024-02-02 14:04:00.262529016][liteserver.cpp:79][!litequery] aborted liteserver query: [Error : -503 : timeout]
[ 2][t 2][2024-02-02 14:04:00.262663247][adnl-ext-server.cpp:34][!manager]      failed ext query: [Error : 651 : node not synced]
[ 2][t 2][2024-02-02 14:04:00.262686567][adnl-ext-server.cpp:34][!manager]      failed ext query: [Error : 651 : node not synced]
[ 2][t 2][2024-02-02 14:04:00.262696887][adnl-ext-server.cpp:34][!manager]      failed ext query: [Error : 651 : node not synced]
[ 2][t 2][2024-02-02 14:04:00.262705788][adnl-ext-server.cpp:34][!manager]      failed ext query: [Error : 651 : node not synced]
^[[ 3][t 3][2024-02-02 14:04:02.325045744][download-archive-slice.cpp:148][!archive]    downloading archive slice #35828238 from yfnIJiL2oWKjJHHg7DzGs6IjxLnqWOxuWbUTcYSwrUw=

Is there any recommendation for solving? @EmelyanenkoK @akifoq @XaBbl4 @aleksej-paschenko

I'm more curious about what kind of hard drive you are using now. Sata SSD? M2 SSD? Or a regular hard drive?

Because building a TON archive node actually has high requirements on I/O speed. It is best to use an M2 SSD.