paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.78k stars 643 forks source link

Slow block import #13

Open purestakeoskar opened 1 year ago

purestakeoskar commented 1 year ago

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

When syncing a node the block import is very slow, to the point where block production is faster than block import. Instead of sync the logs shows Preparing.

2023-04-17 14:41:30.389  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.0 bps, target=#3373001 (31 peers), best: #3372537 (0x41ba…81b4), finalized #3372219 (0x602b…446f), ⬇ 35.2kiB/s ⬆ 4.8kiB/s
2023-04-17 14:41:34.478 DEBUG tokio-runtime-worker runtime::system: [🌗] [3372538] 0 extrinsics, length: 10962 (normal 0%, op: 0%, mandatory 0%) / normal weight:Weight(ref_time: 265621300000, proof_size: 0) (70%) op weight Weight(ref_time: 0, proof_size: 0) (0%) / mandatory weight Weight(ref_time: 7235415758, proof_size: 0) (0%)
2023-04-17 14:41:34.500 TRACE tokio-runtime-worker sync::import-queue: [🌗] Block imported successfully Some(3372538) (0x10fe…ec7d)
2023-04-17 14:41:34.500 TRACE tokio-runtime-worker sync::import-queue: [🌗] Header 0xb8de…8fa7 has 4 logs
2023-04-17 14:41:35.389  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.2 bps, target=#3373001 (31 peers), best: #3372538 (0x10fe…ec7d), finalized #3372219 (0x602b…446f), ⬇ 3.8kiB/s ⬆ 4.8kiB/s
2023-04-17 14:41:38.733 TRACE tokio-runtime-worker sync::import-queue: [🌗] Scheduling 1 blocks for import
2023-04-17 14:41:40.389  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.0 bps, target=#3373002 (31 peers), best: #3372538 (0x10fe…ec7d), finalized #3372219 (0x602b…446f), ⬇ 22.6kiB/s ⬆ 3.9kiB/s
2023-04-17 14:41:45.389  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.0 bps, target=#3373002 (31 peers), best: #3372538 (0x10fe…ec7d), finalized #3372219 (0x602b…446f), ⬇ 7.0kiB/s ⬆ 3.8kiB/s
2023-04-17 14:41:50.389  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.0 bps, target=#3373002 (31 peers), best: #3372538 (0x10fe…ec7d), finalized #3372219 (0x602b…446f), ⬇ 4.2kiB/s ⬆ 4.5kiB/s
2023-04-17 14:41:51.372 TRACE tokio-runtime-worker sync::import-queue: [🌗] Scheduling 1 blocks for import
2023-04-17 14:41:53.983 DEBUG tokio-runtime-worker runtime::system: [🌗] [3372539] 0 extrinsics, length: 45566 (normal 1%, op: 0%, mandatory 0%) / normal weight:Weight(ref_time: 357569150000, proof_size: 0) (95%) op weight Weight(ref_time: 0, proof_size: 0) (0%) / mandatory weight Weight(ref_time: 7235415758, proof_size: 0) (0%)
2023-04-17 14:41:54.008 TRACE tokio-runtime-worker sync::import-queue: [🌗] Block imported successfully Some(3372539) (0xb8de…8fa7)
2023-04-17 14:41:54.008 TRACE tokio-runtime-worker sync::import-queue: [🌗] Header 0x1835…434d has 4 logs
2023-04-17 14:41:55.389  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.2 bps, target=#3373003 (31 peers), best: #3372539 (0xb8de…8fa7), finalized #3372219 (0x602b…446f), ⬇ 11.9kiB/s ⬆ 3.2kiB/s
2023-04-17 14:42:00.390  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.0 bps, target=#3373003 (31 peers), best: #3372539 (0xb8de…8fa7), finalized #3372219 (0x602b…446f), ⬇ 10.7kiB/s ⬆ 4.1kiB/s
2023-04-17 14:42:00.390  WARN tokio-runtime-worker telemetry: [🌗] ❌ Error while dialing /dns/telemetry.polkadot.io/tcp/443/x-parity-wss/%2Fsubmit%2F: Custom { kind: Other, error: Timeout }
2023-04-17 14:42:05.390  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.0 bps, target=#3373003 (31 peers), best: #3372539 (0xb8de…8fa7), finalized #3372219 (0x602b…446f), ⬇ 11.4kiB/s ⬆ 3.0kiB/s
2023-04-17 14:42:06.384 TRACE tokio-runtime-worker sync::import-queue: [🌗] Scheduling 1 blocks for import
2023-04-17 14:42:09.607 DEBUG tokio-runtime-worker runtime::system: [🌗] [3372540] 0 extrinsics, length: 26287 (normal 0%, op: 0%, mandatory 0%) / normal weight:Weight(ref_time: 327287250000, proof_size: 0) (87%) op weight Weight(ref_time: 0, proof_size: 0) (0%) / mandatory weight Weight(ref_time: 7235415758, proof_size: 0) (0%)
2023-04-17 14:42:09.632 TRACE tokio-runtime-worker sync::import-queue: [🌗] Block imported successfully Some(3372540) (0x1835…434d)
2023-04-17 14:42:09.632 TRACE tokio-runtime-worker sync::import-queue: [🌗] Header 0x620c…caa3 has 4 logs
2023-04-17 14:42:10.390  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.2 bps, target=#3373004 (31 peers), best: #3372540 (0x1835…434d), finalized #3372219 (0x602b…446f), ⬇ 29.9kiB/s ⬆ 4.6kiB/s
2023-04-17 14:42:15.390  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.0 bps, target=#3373004 (31 peers), best: #3372540 (0x1835…434d), finalized #3372219 (0x602b…446f), ⬇ 20.8kiB/s ⬆ 4.2kiB/s

The node is connected to peers with blocks we need.

2023-04-17 14:45:38.109  INFO tokio-runtime-worker substrate: [🌗] ⚙️  Preparing  0.0 bps, target=#3373020 (18 peers), best: #3372550 (0x18a5…4ca5), finalized #3372219 (0x602b…446f), ⬇ 5.2kiB/s ⬆ 2.3kiB/s
2023-04-17 14:45:39.845 TRACE tokio-runtime-worker sync: [🌗] New peer 12D3KooWH3nhNXgsiPhVREEcrbbWwVyNfuLnwfrgwdkTPKjtHS2N BlockAnnouncesHandshake { roles: FULL, best_number: 3373019, best_hash: 0x2bb9696090704a6ca516b482a1661177894ba7a63f1dffa3d958edb2d19a980f, genesis_hash: 0xfe58ea77779b7abda7da4ec526d14db9b1e9cd40a217c34892af80a9b332b76d }
2023-04-17 14:45:39.845 DEBUG tokio-runtime-worker sync: [🌗] New peer with known best hash 0x2bb9…980f (3373019).
2023-04-17 14:45:39.845 DEBUG tokio-runtime-worker sync: [🌗] Connected 12D3KooWH3nhNXgsiPhVREEcrbbWwVyNfuLnwfrgwdkTPKjtHS2N
2023-04-17 14:45:41.464 TRACE tokio-runtime-worker sync: [🌗] New peer 12D3KooWGNoqQTFj92X8S2x42uZFexUQRQFeLDiWzarJmXJBV42F BlockAnnouncesHandshake { roles: FULL, best_number: 3373019, best_hash: 0x2bb9696090704a6ca516b482a1661177894ba7a63f1dffa3d958edb2d19a980f, genesis_hash: 0xfe58ea77779b7abda7da4ec526d14db9b1e9cd40a217c34892af80a9b332b76d }
2023-04-17 14:45:41.464 DEBUG tokio-runtime-worker sync: [🌗] New peer with known best hash 0x2bb9…980f (3373019).
2023-04-17 14:45:41.464 DEBUG tokio-runtime-worker sync: [🌗] Connected 12D3KooWGNoqQTFj92X8S2x42uZFexUQRQFeLDiWzarJmXJBV42F
2023-04-17 14:45:41.474 TRACE tokio-runtime-worker sync: [🌗] 12D3KooWGNoqQTFj92X8S2x42uZFexUQRQFeLDiWzarJmXJBV42F Ignoring transactions while major syncing

There are also queued synced blocks (sync_queued_blocks metric) image

Another interesting note is that the node does not know how far it needs to sync. Its sync_target is equal to best_block image

Steps to reproduce

Start syncing a moonbem or moonriver node with archive parachain and pruned relay chain We are running moonbeam version 0.30.3 ( using polkadot 0.9.37 ) Main flags used

--execution=wasm
--pruning=archive
--rpc-cors=all
--unsafe-rpc-external
--unsafe-ws-external
--rpc-methods=safe
--db-cache=1024
--trie-cache-size=1073741824
--runtime-cache-size=32
--eth-log-block-cache=1000000000
--eth-statuses-cache=1000000000
--detailed-log-output
--no-hardware-benchmarks
--ws-max-connections=800
jasl commented 1 year ago

khala-node.log.zip Here's the log (from the Azure VM) with optimization. It is still slow, but the speed has improved compared to the previous log.

I don't change anything about the host OS and only add the tune to the Docker command.

skunert commented 1 year ago

khala-node.log.zip Here's the log (from the Azure VM) with optimization. It is still slow, but compared with the previous log, the speed has improved.

I don't change anything of the host OS, just add the tune to the Docker command

Thanks for your help and investigation! Maybe this is really a config issue. ~Did you try increasing the nofile limit? I nthe light of your earlier comment that might make sense.~ Nevermind, it seems to be already pretty high.

jasl commented 1 year ago

Did you try increasing the nofile limit

Docker default nofile is unlimited, so there is no need to tune this for Docker. For bare metal, people need to increase the value.

With the optimization, my ops colleague still suspects IO, he still sees a bunch of IO waste (in his view)

[May 30, 2023 at 17:27:08]:
just run atop

that shows real time

if you install it, there is a background process that records that ervery 10 mins

you can change that to 1m or something else

you can read those sampling data files

(also via atop)

Does Parity ops team metric how a node use IO? Could you promote them monitoring this field and confirm it is good?

skunert commented 1 year ago

Does Parity ops team metric how a node use IO? Could you promote them monitoring this field and confirm it is good?

@PierreBesson Do you maybe know if we monitor and what is expected during sync?

PierreBesson commented 1 year ago

I think it is expected to have heavy IO when syncing the nodes as you are populating the db. To monitor it your IO you should use the prometheus node-exporter. @jasl if you want to see precise IO usage data over time, you can set it up on your node.

jasl commented 1 year ago

I think it is expected to have heavy IO when syncing the nodes as you are populating the db. To monitor it your IO you should use the prometheus node-exporter. @jasl if you want to see precise IO usage data over time, you can set it up on your node.

Thank you, we shall monitor IO metric

jasl commented 1 year ago

@skunert With the OS tune, the worst case can move forward now. I'm OK to close the issue. Thank you and all participants.

I've proposed to our ops to make a Grafana chart for IO so that I can make a separate issue for that.

BTW, is it possible to separate download and validate blocks (at least in major sync)? I think we can lazily validate blocks and load them in batch, which may reduce IO (especially random IO) and improve efficiency

skunert commented 1 year ago

@skunert With the OS tune, the worst case can move forward now. I'm OK to close the issue. Thank you and all participants.

Okay good to know! Lets keep this open for now, I have one or two more ideas what to look into.

I've proposed to our ops to make a Grafana chart for IO so that I can make a separate issue for that.

BTW, is it possible to separate download and validate blocks (at least in major sync)? I think we can lazily validate blocks and load them in batch, which may reduce IO (especially random IO) and improve efficiency

What do you mean by this? We download blocks in chunks, then add them to an import queue. If there are too many blocks in the queue, we will stop issuing new block requests for a while (current limit is ~2000). But downloading is not really the limiting factor but the block execution. The node executes the downloaded blocks as fast as possible, and there is no way around that for the classic sync strategy since we need to update the state block by block.

jasl commented 1 year ago

What do you mean by this? We download blocks in chunks, then add them to an import queue. If there are too many blocks in the queue, we will stop issuing new block requests for a while (current limit is ~2000). But downloading is not really the limiting factor but the block execution. The node executes the downloaded blocks as fast as possible, and there is no way around that for the classic sync strategy since we need to update the state block by block.

Ok, I got it. It's just my random thought. The substrate has --sync fast mode, so I wonder if I can download blocks and proofs first, then lazy execute them to build full states. Our usage requires archive mode, so we can't use that, but I'm guessing this way could be faster than full

skunert commented 1 year ago

I ran some more tests today, with more added debug logs and a slower disk to make the issue more visible. I had multiple occurrences of very slow imports. The culprit seems to be the commit to the database. The slow imports took multiple minutes from this to this log.

2023-05-31 13:38:15.029 DEBUG tokio-runtime-worker db: [Parachain] DB Commit 0xeba4f1c5d7ab981074384a441af4dc1c1f3ce46f397c562e141a1bbeecd41785 (1221748), best=true, state=true, existing=false, finalized=false    
...
2023-05-31 13:42:24.816 TRACE tokio-runtime-worker db: [Parachain] DB Commit done 0xeba4f1c5d7ab981074384a441af4dc1c1f3ce46f397c562e141a1bbeecd41785    

Took more than 200 seconds in one example and more than 400 seconds in another, seems very excessive. Maybe a bug in paritydb? cc: @arkpar logs

arkpar commented 1 year ago

@jasl What's your OS and filesystem? Could you collect another flamegrath for the slow import?

This is most likely throttling by Azure caused by hitting VM or disk IOPS limits. ParityDB indeed relies a lot on random disk IO. Typically a consumer grade SSD performs well enough so that an archive node can sync with no issues. We haven't seen such problems with other cloud providers as well.

Try setting disk sector size to be as low as possible. I.e 512 bytes.

Regarding ulimit -n, this setting is irrelevant. Polkadot process sets its own limit on startup with a call to setrlimit. It overrides ulimit settings. Page lock limit (memlock) should also have no effect since we don't lock any pages.

jasl commented 1 year ago

@jasl What's your OS and filesystem? Could you collect another flamegrath for the slow import?

This is most likely throttling by Azure caused by hitting VM or disk IOPS limits. ParityDB indeed relies a lot on random disk IO. Typically a consumer grade SSD performs well enough so that an archive node can sync with no issues. We haven't seen such problems with other cloud providers as well.

Try setting disk sector size to be as low as possible. I.e 512 bytes.

Regarding ulimit -n, this setting is irrelevant. Polkadot process sets its own limit on startup with a call to setrlimit. It overrides ulimit settings. Page lock limit (memlock) should also have no effect since we don't lock any pages.

I'll do that later, I can also share consumer primer grade SSD (Crucial MX500 SATA) with the default Ubuntu 22.04 settings, which are slow too (but at least moving)

I also got a report that an entry-level enterprise U2 SSD (WD SN640) has the issue (I report here https://github.com/paritytech/polkadot-sdk/issues/13 ), and this case is odd, with the memlock tune, the user says the node running better, but still very slow.

And I checked chat history. A user claimed he is using Kingston Data Center SEDC500M/1920G, SSD-solid-state drive, Enterprise

jasl commented 1 year ago

@arkpar I'm now testing Crucial MX500 SATA on my Minisforum um773 lite (Ryzen 7735HS 64G DDR5 4800Mhz) mini PC, here's the flaming chart

Ubuntu 22.04.1, EXT4 with LVM enabled, all default settings (change ulimit -n only)

sync sync.svg.zip

The node log sample

[Relaychain] ⚙️  Syncing 14.0 bps, target=#18164707 (40 peers), best: #12548322 (0x6734…933b), finalized #12548096 (0x4fd4…d06c), ⬇ 930.4kiB/s ⬆ 156.1kiB/s
[Parachain] ⚙️  Syncing  2.2 bps, target=#3995002 (40 peers), best: #3231285 (0xaed7…6a41), finalized #1561916 (0xb7a7…a85f), ⬇ 29.8kiB/s ⬆ 7.7kiB/s
[Relaychain] ⚙️  Syncing 17.2 bps, target=#18164710 (38 peers), best: #12548408 (0x6c9b…a290), finalized #12548096 (0x4fd4…d06c), ⬇ 1.8MiB/s ⬆ 127.9kiB/s
[Parachain] ⚙️  Syncing  0.8 bps, target=#3995002 (40 peers), best: #3231289 (0xf68b…e1aa), finalized #1561916 (0xb7a7…a85f), ⬇ 29.4kiB/s ⬆ 12.5kiB/s
[Relaychain] ⚙️  Syncing 51.2 bps, target=#18164710 (40 peers), best: #12548664 (0x7ef4…3f03), finalized #12548608 (0x1f91…435f), ⬇ 659.5kiB/s ⬆ 152.8kiB/s
[Parachain] ⚙️  Syncing  2.4 bps, target=#3995002 (40 peers), best: #3231301 (0xb6e6…544d), finalized #1562070 (0x20e3…6bd7), ⬇ 72.6kiB/s ⬆ 10.0kiB/s
[Relaychain] ⚙️  Syncing 18.8 bps, target=#18164710 (40 peers), best: #12548758 (0xea7d…7971), finalized #12548608 (0x1f91…435f), ⬇ 450.6kiB/s ⬆ 136.7kiB/s
[Parachain] ⚙️  Syncing  1.4 bps, target=#3995002 (40 peers), best: #3231308 (0x51b8…1e1e), finalized #1562070 (0x20e3…6bd7), ⬇ 37.3kiB/s ⬆ 9.7kiB/s
[Relaychain] ⚙️  Syncing  0.0 bps, target=#18164710 (40 peers), best: #12548758 (0xea7d…7971), finalized #12548608 (0x1f91…435f), ⬇ 311.3kiB/s ⬆ 155.5kiB/s
[Parachain] ⚙️  Syncing  0.0 bps, target=#3995002 (40 peers), best: #3231308 (0x51b8…1e1e), finalized #1562070 (0x20e3…6bd7), ⬇ 64.6kiB/s ⬆ 11.0kiB/s
[Relaychain] ⚙️  Syncing  0.0 bps, target=#18164711 (40 peers), best: #12548758 (0xea7d…7971), finalized #12548608 (0x1f91…435f), ⬇ 200.1kiB/s ⬆ 163.3kiB/s
[Parachain] ⚙️  Syncing  0.0 bps, target=#3995002 (40 peers), best: #3231308 (0x51b8…1e1e), finalized #1562070 (0x20e3…6bd7), ⬇ 36.3kiB/s ⬆ 9.8kiB/s
[Relaychain] ⚙️  Syncing  0.0 bps, target=#18164712 (40 peers), best: #12548758 (0xea7d…7971), finalized #12548608 (0x1f91…435f), ⬇ 180.2kiB/s ⬆ 148.3kiB/s
[Parachain] ⚙️  Syncing  0.0 bps, target=#3995002 (40 peers), best: #3231308 (0x51b8…1e1e), finalized #1562070 (0x20e3…6bd7), ⬇ 37.6kiB/s ⬆ 10.0kiB/s
[Relaychain] ⚙️  Syncing  0.0 bps, target=#18164715 (40 peers), best: #12548758 (0xea7d…7971), finalized #12548608 (0x1f91…435f), ⬇ 207.2kiB/s ⬆ 189.8kiB/s
[Parachain] ⚙️  Syncing  0.4 bps, target=#3995002 (40 peers), best: #3231310 (0xd21c…f763), finalized #1562070 (0x20e3…6bd7), ⬇ 50.7kiB/s ⬆ 11.6kiB/s
[Relaychain] ⚙️  Syncing  0.0 bps, target=#18164715 (40 peers), best: #12548758 (0xea7d…7971), finalized #12548608 (0x1f91…435f), ⬇ 174.6kiB/s ⬆ 325.9kiB/s
[Parachain] ⚙️  Syncing  2.8 bps, target=#3995002 (40 peers), best: #3231324 (0xb653…460a), finalized #1562070 (0x20e3…6bd7), ⬇ 17.5kiB/s ⬆ 11.7kiB/s
[Relaychain] ⚙️  Syncing  0.0 bps, target=#18164715 (40 peers), best: #12548758 (0xea7d…7971), finalized #12548608 (0x1f91…435f), ⬇ 229.0kiB/s ⬆ 203.2kiB/s
[Parachain] ⚙️  Syncing  3.0 bps, target=#3995002 (40 peers), best: #3231339 (0x39e6…b670), finalized #1562070 (0x20e3…6bd7), ⬇ 81.1kiB/s ⬆ 10.8kiB/s
[Relaychain] ⚙️  Syncing  2.6 bps, target=#18164715 (40 peers), best: #12548771 (0x77bb…21d1), finalized #12548608 (0x1f91…435f), ⬇ 257.9kiB/s ⬆ 299.7kiB/s
[Parachain] ⚙️  Syncing  3.4 bps, target=#3995002 (40 peers), best: #3231356 (0x45ce…c321), finalized #1562070 (0x20e3…6bd7), ⬇ 24.1kiB/s ⬆ 12.0kiB/s

For Azure, I'll report later.

skunert commented 1 year ago

@jasl What's your OS and filesystem? Could you collect another flamegrath for the slow import?

This is most likely throttling by Azure caused by hitting VM or disk IOPS limits. ParityDB indeed relies a lot on random disk IO. Typically a consumer grade SSD performs well enough so that an archive node can sync with no issues. We haven't seen such problems with other cloud providers as well.

Try setting disk sector size to be as low as possible. I.e 512 bytes.

Regarding ulimit -n, this setting is irrelevant. Polkadot process sets its own limit on startup with a call to setrlimit. It overrides ulimit settings. Page lock limit (memlock) should also have no effect since we don't lock any pages.

The logs in my previous post were gathered by me on a gcloud VM (e2-standard-16) with a standard disk (pd-standard). Standard disk is indeed super slow and I was trying make this issue appear more often. I also have logs of the same machine with a faster (pd-standard) disk where this issue occured and some block imports take ~60s.

Limits for reference:

CleanShot 2023-06-01 at 10 05 02@2x
arkpar commented 1 year ago

Looking at the flamegraph above, there seem to be a lot of header queries made by the transaction pool. I think this was fixed recently @bkchr ?

bkchr commented 1 year ago

I don't get why the transaction pool is querying the headers. The transaction pool only operates on block import or block finalized notifications. There should be no block import notifications when we sync. There could only be a finality notification from time to time. The tx pool will skip the maintenance if the distance between the current and last operated block is too high (more than 20 blocks).

jasl commented 1 year ago

I don't get why the transaction pool is querying the headers. The transaction pool only operates on block import or block finalized notifications. There should be no block import notifications when we sync. There could only be a finality notification from time to time. The tx pool will skip the maintenance if the distance between the current and last operated block is too high (more than 20 blocks).

is this Substrate internal logic or exposed to node's service.rs ?

bkchr commented 1 year ago

Just checked your code. It is internal Substrate logic and I don't see that you call this function on your own.

bkchr commented 1 year ago

@jasl can you run with txpool=trace for some minutes? How long did you run the node to capture the flamegraph above?

jasl commented 1 year ago

@jasl can you run with txpool=trace for some minutes? How long did you run the node to capture the flamegraph above?

That's roughly 30 seconds. I also record one for about 5 - 10 minutes (I start the recording and AFK for awhile)

sync sync.svg.zip

I'll try txpool=trace later

jasl commented 1 year ago

@bkchr I think txpool has trouble, here is the log, I only run the node < 5 min, that generates 8GB log

https://storage.googleapis.com/phala-misc/trace_txpool.log.zip

NOTE: I must confess in the early day Khala online, our pallet has serious bugs we have to clear our collators' txpool for mitigate.

But we found sync Phala (on Polkadot) has the same trouble which old problems have already been fixed properly.

For now I don't have Phala node so I can't share info yet, but I'll do next week

arkpar commented 1 year ago

The log is huge because tree_route is called mutiple times for a long route and the whole thing is just dumped into the log.

2023-06-01 11:40:55.922 DEBUG tokio-runtime-worker txpool: [Parachain] resolve hash:0xfd9bc5828f075da985eefc978d6da257b4a435c5cc8b8495b67a7d2f853b50e5 finalized:true tree_route:TreeRoute { route: [HashAndNumber { number: 3242935, .... 

There was a PR(https://github.com/paritytech/substrate/pull/13004) to prevent this, but apparently it does not help much.

skunert commented 1 year ago

In the logs we see the transaction pool act on finality notifications. Usually we should only see one finality justification for each incoming block with an attached justification. The justification should be attached every justification_period block or at the end of the era.

@jasl However I found that in khala the justification_period is set to 1 here. This is really bad as the node will store every justification and also give it out during sync. This generated finality notifications which in turn trigger actions in the node like in the transaction pool during sync, so it will impact all nodes that request blocks to it from syncing.

jasl commented 1 year ago

In the logs we see the transaction pool act on finality notifications. Usually we should only see one finality justification for each incoming block with an attached justification. The justification should be attached every justification_period block or at the end of the era.

@jasl However I found that in khala the justification_period is set to 1 here. This is really bad as the node will store every justification and also give it out during sync. This generated finality notifications which in turn trigger actions in the node like in the transaction pool during sync, so it will impact all nodes that request blocks to it from syncing.

Yeah, we patched the justification_period for historical reasons, do you mean the txpool will be processed when generating finality justification? if we revert to default value 512 we would see each 512 blocks process txpool which reduce the pressure of node

skunert commented 1 year ago

I mean its not just the transaction pool, basically every location that listens for finality notifications in substrate will be notified. In addition we verify the justification, do extra operations in the database, state-db etc.. This brings extra overhead that we should not have for every single block.

jasl commented 1 year ago

I mean its not just the transaction pool, basically every location that listens for finality notifications in substrate will be notified. In addition we verify the justification, do extra operations in the database, state-db etc.. This brings extra overhead that we should not have for every single block.

I can understand we patched justification_period to 1 introducing additional performance impact, I'll ask our dev for whether I can remove the patch.

But I still don't quite understand: Do you mean txpool awakens by finality notifications and this is required for justification verification? or do you mean in major sync, txpool shouldn't awaken even if it got a finality notification?

PS: I'm too noob in this domain, is there any doc I can learn?

bkchr commented 1 year ago

I can understand we patched justification_period to 1 introducing additional performance impact, I'll ask our dev for whether I can remove the patch.

There is a RPC to proof finality that you can use.

Do you mean txpool awakens by finality notifications and this is required for justification verification? or do you mean in major sync, txpool shouldn't awaken even if it got a finality notification?

https://github.com/paritytech/substrate/pull/14285 this will improve the situation. However, I also think that there is maybe still more to it.

jasl commented 1 year ago

I can understand we patched justification_period to 1 introducing additional performance impact, I'll ask our dev for whether I can remove the patch.

There is a RPC to proof finality that you can use.

Do you mean txpool awakens by finality notifications and this is required for justification verification? or do you mean in major sync, txpool shouldn't awaken even if it got a finality notification?

paritytech/substrate#14285 this will improve the situation. However, I also think that there is maybe still more to it.

Thank you, I can backport your PR to our node to test

skunert commented 1 year ago

Do you mean txpool awakens by finality notifications and this is required for justification verification? or do you mean in major sync, txpool shouldn't awaken even if it got a finality notification?

Transaction pool listens for these notifications and does some maintenance work. We have logic in place to skip this based on block distances. However when the node is finalizing every block, this maintenance is triggered too often. But as I said, there are other problems as well. Finalization of each block itself also takes more resources than finalizing in batches.

Even if you disable this now, your existing nodes will have the justifications in their db. Ideally, the node would not finalize for every incoming justification during sync, so that will probably change.

jasl commented 1 year ago

I've backported https://github.com/paritytech/substrate/pull/14285 to our Khala node Here are new flame graphs

Short (run about 20 s) sync_short sync_short.svg.zip

Long (run about 1.5 min) sync_long sync_long.svg.zip

bkchr commented 1 year ago

@jasl can you please share the exact CLI args you are using to spawn your node?

This could still be a result of your relay chain node importing a justification every block. We need to wait for @andresilva to provide a fix that ensures that we don't import justifications every block as you have configured it (which is wrong and should be reverted!).

jasl commented 1 year ago

@jasl can you please share the exact CLI args you are using to spawn your node?

This could still be a result of your relay chain node importing a justification every block. We need to wait for @andresilva to provide a fix that ensures that we don't import justifications every block as you have configured it (which is wrong and should be reverted!).

./khala-node \
  --chain khala \
  --base-path $DATA_PATH \
  --name $NODE_NAME \
  --port 30333 \
  --prometheus-port 9615 \
  --rpc-port 9933 \
  --ws-port 9944 \
  --database paritydb \
  --no-hardware-benchmarks \
  --no-telemetry \
  --rpc-max-response-size 64 \
  --max-runtime-instances 16 \
  --runtime-cache-size 8 \
  --state-pruning archive-canonical \
  --blocks-pruning archive-canonical \
  -- \
  --chain kusama \
  --port 30334 \
  --prometheus-port 9616 \
  --rpc-port 9934 \
  --ws-port 9945 \
  --database paritydb \
  --no-hardware-benchmarks \
  --no-telemetry \
  --rpc-max-response-size 64 \
  --max-runtime-instances 16 \
  --runtime-cache-size 8 \
  --state-pruning archive-canonical \
  --blocks-pruning archive-canonical

(which is wrong and should be reverted!).

Thank you, now I understand how bad it would be if we set a very short justification_period, I'll forward this warning to our team.

skunert commented 1 year ago

@jasl If you want you could try to run your node with --relay-chain-rpc-url <external-relay-chain-rpc> and point it to one of your polkadot nodes. This will not start an internal polkadot node and thus the justification period will not have an impact there. This way you could at least see if it is syncing faster with this issue eliminated even before we have the proper fix in place.

jasl commented 1 year ago

@jasl If you want you could try to run your node with --relay-chain-rpc-url <external-relay-chain-rpc> and point it to one of your polkadot nodes. This will not start an internal polkadot node and thus the justification period will not have an impact there. This way you could at least see if it is syncing faster with this issue eliminated even before we have the proper fix in place.

Our collators are using --relay-chain-rpc-url <external-relay-chain-rpc>, and seem not to see any sync issue, I'll try for our normal nodes later.

But running 2 apps in one Docker container is not recommended, so I would like to test, for end-users I shall waiting the proper fix

bLd75 commented 1 year ago

Having here the same behaviour on an Astar node. It's been tested with relay RPC sync too but the issue is the same.

It's important to note that it happens only on RPC under high number of requests. On the screenshot below we can see the same behavior, node loses sights of head of the chain pretty quickly after start, and this gap increases over time. While finalization is getting completely stuck. Node gets restarted after an hour (18:15 on the graph), resyncing with the right chain height, then losing it again. image

skunert commented 1 year ago

@bLd75 Can you provide logs with -lsync=trace for the time period where the issue appears? The issues we had previously discussed in the comments should not occur when you run with an external relay chain node.

crystalin commented 1 year ago

This is still an issue on Polkadot 0.9.43: Looking breifly at the logs of sync=trace, it seems that blocks are imported fast initially but the more and more time passes, the less concurrent block are processed. It looks like some block import might get stuck forever. sync.log

crystalin commented 1 year ago

Additional information. It seems that the import is limited by the Disk IOPS (this is specially the case in Cloud services providing low default IOPS). Surprisingly, even when sync is at 0.0/s it still is using the max IOPS (3000)

I think something changed in recent version (probably around 0.9.37) which is significantly increasing the IOPS reaching a limit where it snowballs and block most of the processes.

Ex:

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
nvme6n1        3032.20     15671.60         0.00         0.00     156716          0          0

Increasing it (realtime) to 10k IOPS allows to restore some bps but it quickly goes down to 0.2 or 0.0 bps. I think something is inefficient and snowballing.

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
nvme6n1        10114.10     52228.00       356.00         0.00     522280       3560          0
Jul 19 08:13:35 ip-10-0-0-228 parity-moonbeam[375045]: 2023-07-19 08:13:35 [Relaychain] ⚙️  Syncing  0.2 bps, target=#18838858 (43 peers), best: #14447387 (0x7cb5…d01f), finalized #14447104 (0x8c38…91fb), ⬇ 517.3kiB/s  449.1kiB/s
Jul 19 08:13:35 ip-10-0-0-228 parity-moonbeam[375045]: 2023-07-19 08:13:35 [🌗] ⚙️  Syncing  0.2 bps, target=#4697322 (27 peers), best: #3451127 (0x41cd…1d03), finalized #2565171 (0xb87f…89bb), ⬇ 8.2kiB/s  1.4kiB/s

=== Experiment 1: rocksdb from scratch === I tried sync from scratch using rocks db:

Jul 19 08:30:58 ip-10-0-0-228 parity-moonbeam[1354684]: 2023-07-19 08:30:58 [Relaychain] ⚙️  Syncing 368.8 bps, target=#16465081 (1 peers), best: #19264 (0x4555…7066), finalized #18944 (0x48b9…c40b), ⬇ 210.2kiB/s ⬆ 28.6kiB/s
Jul 19 08:30:58 ip-10-0-0-228 parity-moonbeam[1354684]: 2023-07-19 08:30:58 [🌗] ⚙️  Syncing 131.8 bps, target=#4025691 (1 peers), best: paritytech/substrate#5315 (0x7cea…36da), finalized #0 (0xfe58…b76d), ⬇ 510.3kiB/s ⬆ 0.3kiB/s

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
nvme6n1          74.20        26.40     17532.80         0.00        264     175328          0

(for 370 bps, IOPS is only ~100, however the chain/state was empty as it was the beginning of the sync)

=== Experiment 2: paritydb from scratch ===

Jul 19 08:34:57 ip-10-0-0-228 parity-moonbeam[1358078]: 2023-07-19 08:34:57 [Relaychain] ⚙️  Syncing 639.2 bps, target=#16465119 (1 peers), best: #46670 (0xf6a7…84f9), finalized #46592 (0x1456…92dd), ⬇ 261.1kiB/s ⬆ 13.7kiB/s
Jul 19 08:34:57 ip-10-0-0-228 parity-moonbeam[1358078]: 2023-07-19 08:34:57 [🌗] ⚙️  Syncing 152.2 bps, target=#4025708 (1 peers), best: paritytech/substrate#11722 (0x8186…cf93), finalized #0 (0xfe58…b76d), ⬇ 465.1kiB/s ⬆ 0.3kiB/s

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
nvme6n1        3137.40        24.80     55084.00         0.00        248     550840          0

(for 370 bps, IOPS is ~3000 which is a lot compared to rocksdb)

jasl commented 1 year ago

This is still an issue on Polkadot 0.9.43: Looking breifly at the logs of sync=trace, it seems that blocks are imported fast initially but the more and more time passes, the less concurrent block are processed. It looks like some block import might get stuck forever. sync.log

https://github.com/paritytech/substrate/pull/14285 seems not bundled in 0.9.43, but in 0.9.44. https://github.com/paritytech/substrate/pull/14423 have to wait .45

You have to backport it by yourself. we forked 0.9.43 and cherry-pick these changes. here's our sample https://github.com/Phala-Network/khala-parachain/blob/main/Cargo.toml#L60-L65

crystalin commented 1 year ago

Thank you, I'll try those

lexnv commented 2 months ago

@jasl @crystalin Do you still get this issue with the latest release?