Open purestakeoskar opened 1 year ago
khala-node.log.zip Here's the log (from the Azure VM) with optimization. It is still slow, but the speed has improved compared to the previous log.
I don't change anything about the host OS and only add the tune to the Docker command.
khala-node.log.zip Here's the log (from the Azure VM) with optimization. It is still slow, but compared with the previous log, the speed has improved.
I don't change anything of the host OS, just add the tune to the Docker command
Thanks for your help and investigation! Maybe this is really a config issue. ~Did you try increasing the nofile limit? I nthe light of your earlier comment that might make sense.~ Nevermind, it seems to be already pretty high.
Did you try increasing the nofile limit
Docker default nofile
is unlimited
, so there is no need to tune this for Docker.
For bare metal, people need to increase the value.
With the optimization, my ops colleague still suspects IO, he still sees a bunch of IO waste (in his view)
[May 30, 2023 at 17:27:08]:
just run atop
that shows real time
if you install it, there is a background process that records that ervery 10 mins
you can change that to 1m or something else
you can read those sampling data files
(also via atop)
Does Parity ops team metric how a node use IO? Could you promote them monitoring this field and confirm it is good?
Does Parity ops team metric how a node use IO? Could you promote them monitoring this field and confirm it is good?
@PierreBesson Do you maybe know if we monitor and what is expected during sync?
I think it is expected to have heavy IO when syncing the nodes as you are populating the db. To monitor it your IO you should use the prometheus node-exporter. @jasl if you want to see precise IO usage data over time, you can set it up on your node.
I think it is expected to have heavy IO when syncing the nodes as you are populating the db. To monitor it your IO you should use the prometheus node-exporter. @jasl if you want to see precise IO usage data over time, you can set it up on your node.
Thank you, we shall monitor IO metric
@skunert With the OS tune, the worst case can move forward now. I'm OK to close the issue. Thank you and all participants.
I've proposed to our ops to make a Grafana chart for IO so that I can make a separate issue for that.
BTW, is it possible to separate download and validate blocks (at least in major sync)? I think we can lazily validate blocks and load them in batch, which may reduce IO (especially random IO) and improve efficiency
@skunert With the OS tune, the worst case can move forward now. I'm OK to close the issue. Thank you and all participants.
Okay good to know! Lets keep this open for now, I have one or two more ideas what to look into.
I've proposed to our ops to make a Grafana chart for IO so that I can make a separate issue for that.
BTW, is it possible to separate download and validate blocks (at least in major sync)? I think we can lazily validate blocks and load them in batch, which may reduce IO (especially random IO) and improve efficiency
What do you mean by this? We download blocks in chunks, then add them to an import queue. If there are too many blocks in the queue, we will stop issuing new block requests for a while (current limit is ~2000). But downloading is not really the limiting factor but the block execution. The node executes the downloaded blocks as fast as possible, and there is no way around that for the classic sync strategy since we need to update the state block by block.
What do you mean by this? We download blocks in chunks, then add them to an import queue. If there are too many blocks in the queue, we will stop issuing new block requests for a while (current limit is ~2000). But downloading is not really the limiting factor but the block execution. The node executes the downloaded blocks as fast as possible, and there is no way around that for the classic sync strategy since we need to update the state block by block.
Ok, I got it. It's just my random thought.
The substrate has --sync fast
mode, so I wonder if I can download blocks and proofs first, then lazy execute them to build full states.
Our usage requires archive
mode, so we can't use that, but I'm guessing this way could be faster than full
I ran some more tests today, with more added debug logs and a slower disk to make the issue more visible. I had multiple occurrences of very slow imports. The culprit seems to be the commit to the database. The slow imports took multiple minutes from this to this log.
2023-05-31 13:38:15.029 DEBUG tokio-runtime-worker db: [Parachain] DB Commit 0xeba4f1c5d7ab981074384a441af4dc1c1f3ce46f397c562e141a1bbeecd41785 (1221748), best=true, state=true, existing=false, finalized=false
...
2023-05-31 13:42:24.816 TRACE tokio-runtime-worker db: [Parachain] DB Commit done 0xeba4f1c5d7ab981074384a441af4dc1c1f3ce46f397c562e141a1bbeecd41785
Took more than 200 seconds in one example and more than 400 seconds in another, seems very excessive. Maybe a bug in paritydb? cc: @arkpar logs
@jasl What's your OS and filesystem? Could you collect another flamegrath for the slow import?
This is most likely throttling by Azure caused by hitting VM or disk IOPS limits. ParityDB indeed relies a lot on random disk IO. Typically a consumer grade SSD performs well enough so that an archive node can sync with no issues. We haven't seen such problems with other cloud providers as well.
Try setting disk sector size to be as low as possible. I.e 512 bytes.
Regarding ulimit -n
, this setting is irrelevant. Polkadot process sets its own limit on startup with a call to setrlimit
. It overrides ulimit
settings. Page lock limit (memlock
) should also have no effect since we don't lock any pages.
@jasl What's your OS and filesystem? Could you collect another flamegrath for the slow import?
This is most likely throttling by Azure caused by hitting VM or disk IOPS limits. ParityDB indeed relies a lot on random disk IO. Typically a consumer grade SSD performs well enough so that an archive node can sync with no issues. We haven't seen such problems with other cloud providers as well.
Try setting disk sector size to be as low as possible. I.e 512 bytes.
Regarding
ulimit -n
, this setting is irrelevant. Polkadot process sets its own limit on startup with a call tosetrlimit
. It overridesulimit
settings. Page lock limit (memlock
) should also have no effect since we don't lock any pages.
I'll do that later, I can also share consumer primer grade SSD (Crucial MX500 SATA) with the default Ubuntu 22.04 settings, which are slow too (but at least moving)
I also got a report that an entry-level enterprise U2 SSD (WD SN640) has the issue (I report here https://github.com/paritytech/polkadot-sdk/issues/13 ), and this case is odd, with the memlock
tune, the user says the node running better, but still very slow.
And I checked chat history.
A user claimed he is using Kingston Data Center SEDC500M/1920G, SSD-solid-state drive, Enterprise
@arkpar I'm now testing Crucial MX500 SATA on my Minisforum um773 lite (Ryzen 7735HS 64G DDR5 4800Mhz) mini PC, here's the flaming chart
Ubuntu 22.04.1, EXT4 with LVM enabled, all default settings (change ulimit -n
only)
The node log sample
[Relaychain] ⚙️ Syncing 14.0 bps, target=#18164707 (40 peers), best: #12548322 (0x6734…933b), finalized #12548096 (0x4fd4…d06c), ⬇ 930.4kiB/s ⬆ 156.1kiB/s
[Parachain] ⚙️ Syncing 2.2 bps, target=#3995002 (40 peers), best: #3231285 (0xaed7…6a41), finalized #1561916 (0xb7a7…a85f), ⬇ 29.8kiB/s ⬆ 7.7kiB/s
[Relaychain] ⚙️ Syncing 17.2 bps, target=#18164710 (38 peers), best: #12548408 (0x6c9b…a290), finalized #12548096 (0x4fd4…d06c), ⬇ 1.8MiB/s ⬆ 127.9kiB/s
[Parachain] ⚙️ Syncing 0.8 bps, target=#3995002 (40 peers), best: #3231289 (0xf68b…e1aa), finalized #1561916 (0xb7a7…a85f), ⬇ 29.4kiB/s ⬆ 12.5kiB/s
[Relaychain] ⚙️ Syncing 51.2 bps, target=#18164710 (40 peers), best: #12548664 (0x7ef4…3f03), finalized #12548608 (0x1f91…435f), ⬇ 659.5kiB/s ⬆ 152.8kiB/s
[Parachain] ⚙️ Syncing 2.4 bps, target=#3995002 (40 peers), best: #3231301 (0xb6e6…544d), finalized #1562070 (0x20e3…6bd7), ⬇ 72.6kiB/s ⬆ 10.0kiB/s
[Relaychain] ⚙️ Syncing 18.8 bps, target=#18164710 (40 peers), best: #12548758 (0xea7d…7971), finalized #12548608 (0x1f91…435f), ⬇ 450.6kiB/s ⬆ 136.7kiB/s
[Parachain] ⚙️ Syncing 1.4 bps, target=#3995002 (40 peers), best: #3231308 (0x51b8…1e1e), finalized #1562070 (0x20e3…6bd7), ⬇ 37.3kiB/s ⬆ 9.7kiB/s
[Relaychain] ⚙️ Syncing 0.0 bps, target=#18164710 (40 peers), best: #12548758 (0xea7d…7971), finalized #12548608 (0x1f91…435f), ⬇ 311.3kiB/s ⬆ 155.5kiB/s
[Parachain] ⚙️ Syncing 0.0 bps, target=#3995002 (40 peers), best: #3231308 (0x51b8…1e1e), finalized #1562070 (0x20e3…6bd7), ⬇ 64.6kiB/s ⬆ 11.0kiB/s
[Relaychain] ⚙️ Syncing 0.0 bps, target=#18164711 (40 peers), best: #12548758 (0xea7d…7971), finalized #12548608 (0x1f91…435f), ⬇ 200.1kiB/s ⬆ 163.3kiB/s
[Parachain] ⚙️ Syncing 0.0 bps, target=#3995002 (40 peers), best: #3231308 (0x51b8…1e1e), finalized #1562070 (0x20e3…6bd7), ⬇ 36.3kiB/s ⬆ 9.8kiB/s
[Relaychain] ⚙️ Syncing 0.0 bps, target=#18164712 (40 peers), best: #12548758 (0xea7d…7971), finalized #12548608 (0x1f91…435f), ⬇ 180.2kiB/s ⬆ 148.3kiB/s
[Parachain] ⚙️ Syncing 0.0 bps, target=#3995002 (40 peers), best: #3231308 (0x51b8…1e1e), finalized #1562070 (0x20e3…6bd7), ⬇ 37.6kiB/s ⬆ 10.0kiB/s
[Relaychain] ⚙️ Syncing 0.0 bps, target=#18164715 (40 peers), best: #12548758 (0xea7d…7971), finalized #12548608 (0x1f91…435f), ⬇ 207.2kiB/s ⬆ 189.8kiB/s
[Parachain] ⚙️ Syncing 0.4 bps, target=#3995002 (40 peers), best: #3231310 (0xd21c…f763), finalized #1562070 (0x20e3…6bd7), ⬇ 50.7kiB/s ⬆ 11.6kiB/s
[Relaychain] ⚙️ Syncing 0.0 bps, target=#18164715 (40 peers), best: #12548758 (0xea7d…7971), finalized #12548608 (0x1f91…435f), ⬇ 174.6kiB/s ⬆ 325.9kiB/s
[Parachain] ⚙️ Syncing 2.8 bps, target=#3995002 (40 peers), best: #3231324 (0xb653…460a), finalized #1562070 (0x20e3…6bd7), ⬇ 17.5kiB/s ⬆ 11.7kiB/s
[Relaychain] ⚙️ Syncing 0.0 bps, target=#18164715 (40 peers), best: #12548758 (0xea7d…7971), finalized #12548608 (0x1f91…435f), ⬇ 229.0kiB/s ⬆ 203.2kiB/s
[Parachain] ⚙️ Syncing 3.0 bps, target=#3995002 (40 peers), best: #3231339 (0x39e6…b670), finalized #1562070 (0x20e3…6bd7), ⬇ 81.1kiB/s ⬆ 10.8kiB/s
[Relaychain] ⚙️ Syncing 2.6 bps, target=#18164715 (40 peers), best: #12548771 (0x77bb…21d1), finalized #12548608 (0x1f91…435f), ⬇ 257.9kiB/s ⬆ 299.7kiB/s
[Parachain] ⚙️ Syncing 3.4 bps, target=#3995002 (40 peers), best: #3231356 (0x45ce…c321), finalized #1562070 (0x20e3…6bd7), ⬇ 24.1kiB/s ⬆ 12.0kiB/s
For Azure, I'll report later.
@jasl What's your OS and filesystem? Could you collect another flamegrath for the slow import?
This is most likely throttling by Azure caused by hitting VM or disk IOPS limits. ParityDB indeed relies a lot on random disk IO. Typically a consumer grade SSD performs well enough so that an archive node can sync with no issues. We haven't seen such problems with other cloud providers as well.
Try setting disk sector size to be as low as possible. I.e 512 bytes.
Regarding
ulimit -n
, this setting is irrelevant. Polkadot process sets its own limit on startup with a call tosetrlimit
. It overridesulimit
settings. Page lock limit (memlock
) should also have no effect since we don't lock any pages.
The logs in my previous post were gathered by me on a gcloud VM (e2-standard-16) with a standard disk (pd-standard). Standard disk is indeed super slow and I was trying make this issue appear more often. I also have logs of the same machine with a faster (pd-standard) disk where this issue occured and some block imports take ~60s.
Limits for reference:
Looking at the flamegraph above, there seem to be a lot of header queries made by the transaction pool. I think this was fixed recently @bkchr ?
I don't get why the transaction pool is querying the headers. The transaction pool only operates on block import or block finalized notifications. There should be no block import notifications when we sync. There could only be a finality notification from time to time. The tx pool will skip the maintenance if the distance between the current and last operated block is too high (more than 20 blocks).
I don't get why the transaction pool is querying the headers. The transaction pool only operates on block import or block finalized notifications. There should be no block import notifications when we sync. There could only be a finality notification from time to time. The tx pool will skip the maintenance if the distance between the current and last operated block is too high (more than 20 blocks).
is this Substrate internal logic or exposed to node's service.rs
?
Just checked your code. It is internal Substrate logic and I don't see that you call this function on your own.
@jasl can you run with txpool=trace
for some minutes? How long did you run the node to capture the flamegraph above?
@jasl can you run with
txpool=trace
for some minutes? How long did you run the node to capture the flamegraph above?
That's roughly 30 seconds. I also record one for about 5 - 10 minutes (I start the recording and AFK for awhile)
I'll try txpool=trace
later
@bkchr I think txpool has trouble, here is the log, I only run the node < 5 min, that generates 8GB log
https://storage.googleapis.com/phala-misc/trace_txpool.log.zip
NOTE: I must confess in the early day Khala online, our pallet has serious bugs we have to clear our collators' txpool for mitigate.
But we found sync Phala (on Polkadot) has the same trouble which old problems have already been fixed properly.
For now I don't have Phala node so I can't share info yet, but I'll do next week
The log is huge because tree_route
is called mutiple times for a long route and the whole thing is just dumped into the log.
2023-06-01 11:40:55.922 DEBUG tokio-runtime-worker txpool: [Parachain] resolve hash:0xfd9bc5828f075da985eefc978d6da257b4a435c5cc8b8495b67a7d2f853b50e5 finalized:true tree_route:TreeRoute { route: [HashAndNumber { number: 3242935, ....
There was a PR(https://github.com/paritytech/substrate/pull/13004) to prevent this, but apparently it does not help much.
In the logs we see the transaction pool act on finality notifications. Usually we should only see one finality justification for each incoming block with an attached justification. The justification should be attached every justification_period
block or at the end of the era.
@jasl However I found that in khala the justification_period
is set to 1 here. This is really bad as the node will store every justification and also give it out during sync. This generated finality notifications which in turn trigger actions in the node like in the transaction pool during sync, so it will impact all nodes that request blocks to it from syncing.
In the logs we see the transaction pool act on finality notifications. Usually we should only see one finality justification for each incoming block with an attached justification. The justification should be attached every
justification_period
block or at the end of the era.@jasl However I found that in khala the
justification_period
is set to 1 here. This is really bad as the node will store every justification and also give it out during sync. This generated finality notifications which in turn trigger actions in the node like in the transaction pool during sync, so it will impact all nodes that request blocks to it from syncing.
Yeah, we patched the justification_period
for historical reasons, do you mean the txpool will be processed when generating finality justification? if we revert to default value 512
we would see each 512 blocks process txpool which reduce the pressure of node
I mean its not just the transaction pool, basically every location that listens for finality notifications in substrate will be notified. In addition we verify the justification, do extra operations in the database, state-db etc.. This brings extra overhead that we should not have for every single block.
I mean its not just the transaction pool, basically every location that listens for finality notifications in substrate will be notified. In addition we verify the justification, do extra operations in the database, state-db etc.. This brings extra overhead that we should not have for every single block.
I can understand we patched justification_period
to 1 introducing additional performance impact, I'll ask our dev for whether I can remove the patch.
But I still don't quite understand: Do you mean txpool awakens by finality notifications and this is required for justification verification? or do you mean in major sync, txpool shouldn't awaken even if it got a finality notification?
PS: I'm too noob in this domain, is there any doc I can learn?
I can understand we patched
justification_period
to 1 introducing additional performance impact, I'll ask our dev for whether I can remove the patch.
There is a RPC to proof finality that you can use.
Do you mean txpool awakens by finality notifications and this is required for justification verification? or do you mean in major sync, txpool shouldn't awaken even if it got a finality notification?
https://github.com/paritytech/substrate/pull/14285 this will improve the situation. However, I also think that there is maybe still more to it.
I can understand we patched
justification_period
to 1 introducing additional performance impact, I'll ask our dev for whether I can remove the patch.There is a RPC to proof finality that you can use.
Do you mean txpool awakens by finality notifications and this is required for justification verification? or do you mean in major sync, txpool shouldn't awaken even if it got a finality notification?
paritytech/substrate#14285 this will improve the situation. However, I also think that there is maybe still more to it.
Thank you, I can backport your PR to our node to test
Do you mean txpool awakens by finality notifications and this is required for justification verification? or do you mean in major sync, txpool shouldn't awaken even if it got a finality notification?
Transaction pool listens for these notifications and does some maintenance work. We have logic in place to skip this based on block distances. However when the node is finalizing every block, this maintenance is triggered too often. But as I said, there are other problems as well. Finalization of each block itself also takes more resources than finalizing in batches.
Even if you disable this now, your existing nodes will have the justifications in their db. Ideally, the node would not finalize for every incoming justification during sync, so that will probably change.
I've backported https://github.com/paritytech/substrate/pull/14285 to our Khala node Here are new flame graphs
Short (run about 20 s) sync_short.svg.zip
Long (run about 1.5 min) sync_long.svg.zip
@jasl can you please share the exact CLI args you are using to spawn your node?
This could still be a result of your relay chain node importing a justification every block. We need to wait for @andresilva to provide a fix that ensures that we don't import justifications every block as you have configured it (which is wrong and should be reverted!).
@jasl can you please share the exact CLI args you are using to spawn your node?
This could still be a result of your relay chain node importing a justification every block. We need to wait for @andresilva to provide a fix that ensures that we don't import justifications every block as you have configured it (which is wrong and should be reverted!).
./khala-node \
--chain khala \
--base-path $DATA_PATH \
--name $NODE_NAME \
--port 30333 \
--prometheus-port 9615 \
--rpc-port 9933 \
--ws-port 9944 \
--database paritydb \
--no-hardware-benchmarks \
--no-telemetry \
--rpc-max-response-size 64 \
--max-runtime-instances 16 \
--runtime-cache-size 8 \
--state-pruning archive-canonical \
--blocks-pruning archive-canonical \
-- \
--chain kusama \
--port 30334 \
--prometheus-port 9616 \
--rpc-port 9934 \
--ws-port 9945 \
--database paritydb \
--no-hardware-benchmarks \
--no-telemetry \
--rpc-max-response-size 64 \
--max-runtime-instances 16 \
--runtime-cache-size 8 \
--state-pruning archive-canonical \
--blocks-pruning archive-canonical
(which is wrong and should be reverted!).
Thank you, now I understand how bad it would be if we set a very short justification_period
, I'll forward this warning to our team.
@jasl If you want you could try to run your node with --relay-chain-rpc-url <external-relay-chain-rpc>
and point it to one of your polkadot nodes. This will not start an internal polkadot node and thus the justification period will not have an impact there. This way you could at least see if it is syncing faster with this issue eliminated even before we have the proper fix in place.
@jasl If you want you could try to run your node with
--relay-chain-rpc-url <external-relay-chain-rpc>
and point it to one of your polkadot nodes. This will not start an internal polkadot node and thus the justification period will not have an impact there. This way you could at least see if it is syncing faster with this issue eliminated even before we have the proper fix in place.
Our collators are using --relay-chain-rpc-url <external-relay-chain-rpc>
, and seem not to see any sync issue, I'll try for our normal nodes later.
But running 2 apps in one Docker container is not recommended, so I would like to test, for end-users I shall waiting the proper fix
Having here the same behaviour on an Astar node. It's been tested with relay RPC sync too but the issue is the same.
It's important to note that it happens only on RPC under high number of requests. On the screenshot below we can see the same behavior, node loses sights of head of the chain pretty quickly after start, and this gap increases over time. While finalization is getting completely stuck. Node gets restarted after an hour (18:15 on the graph), resyncing with the right chain height, then losing it again.
@bLd75 Can you provide logs with -lsync=trace
for the time period where the issue appears? The issues we had previously discussed in the comments should not occur when you run with an external relay chain node.
This is still an issue on Polkadot 0.9.43: Looking breifly at the logs of sync=trace, it seems that blocks are imported fast initially but the more and more time passes, the less concurrent block are processed. It looks like some block import might get stuck forever. sync.log
Additional information. It seems that the import is limited by the Disk IOPS (this is specially the case in Cloud services providing low default IOPS). Surprisingly, even when sync is at 0.0/s it still is using the max IOPS (3000)
I think something changed in recent version (probably around 0.9.37) which is significantly increasing the IOPS reaching a limit where it snowballs and block most of the processes.
Ex:
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
nvme6n1 3032.20 15671.60 0.00 0.00 156716 0 0
Increasing it (realtime) to 10k IOPS allows to restore some bps but it quickly goes down to 0.2 or 0.0 bps. I think something is inefficient and snowballing.
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
nvme6n1 10114.10 52228.00 356.00 0.00 522280 3560 0
Jul 19 08:13:35 ip-10-0-0-228 parity-moonbeam[375045]: 2023-07-19 08:13:35 [Relaychain] ⚙️ Syncing 0.2 bps, target=#18838858 (43 peers), best: #14447387 (0x7cb5…d01f), finalized #14447104 (0x8c38…91fb), ⬇ 517.3kiB/s 449.1kiB/s
Jul 19 08:13:35 ip-10-0-0-228 parity-moonbeam[375045]: 2023-07-19 08:13:35 [🌗] ⚙️ Syncing 0.2 bps, target=#4697322 (27 peers), best: #3451127 (0x41cd…1d03), finalized #2565171 (0xb87f…89bb), ⬇ 8.2kiB/s 1.4kiB/s
=== Experiment 1: rocksdb from scratch === I tried sync from scratch using rocks db:
Jul 19 08:30:58 ip-10-0-0-228 parity-moonbeam[1354684]: 2023-07-19 08:30:58 [Relaychain] ⚙️ Syncing 368.8 bps, target=#16465081 (1 peers), best: #19264 (0x4555…7066), finalized #18944 (0x48b9…c40b), ⬇ 210.2kiB/s ⬆ 28.6kiB/s
Jul 19 08:30:58 ip-10-0-0-228 parity-moonbeam[1354684]: 2023-07-19 08:30:58 [🌗] ⚙️ Syncing 131.8 bps, target=#4025691 (1 peers), best: paritytech/substrate#5315 (0x7cea…36da), finalized #0 (0xfe58…b76d), ⬇ 510.3kiB/s ⬆ 0.3kiB/s
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
nvme6n1 74.20 26.40 17532.80 0.00 264 175328 0
(for 370 bps, IOPS is only ~100, however the chain/state was empty as it was the beginning of the sync)
=== Experiment 2: paritydb from scratch ===
Jul 19 08:34:57 ip-10-0-0-228 parity-moonbeam[1358078]: 2023-07-19 08:34:57 [Relaychain] ⚙️ Syncing 639.2 bps, target=#16465119 (1 peers), best: #46670 (0xf6a7…84f9), finalized #46592 (0x1456…92dd), ⬇ 261.1kiB/s ⬆ 13.7kiB/s
Jul 19 08:34:57 ip-10-0-0-228 parity-moonbeam[1358078]: 2023-07-19 08:34:57 [🌗] ⚙️ Syncing 152.2 bps, target=#4025708 (1 peers), best: paritytech/substrate#11722 (0x8186…cf93), finalized #0 (0xfe58…b76d), ⬇ 465.1kiB/s ⬆ 0.3kiB/s
Device tps kB_read/s kB_wrtn/s kB_dscd/s kB_read kB_wrtn kB_dscd
nvme6n1 3137.40 24.80 55084.00 0.00 248 550840 0
(for 370 bps, IOPS is ~3000 which is a lot compared to rocksdb)
This is still an issue on Polkadot 0.9.43: Looking breifly at the logs of sync=trace, it seems that blocks are imported fast initially but the more and more time passes, the less concurrent block are processed. It looks like some block import might get stuck forever. sync.log
https://github.com/paritytech/substrate/pull/14285 seems not bundled in 0.9.43, but in 0.9.44. https://github.com/paritytech/substrate/pull/14423 have to wait .45
You have to backport it by yourself. we forked 0.9.43 and cherry-pick these changes. here's our sample https://github.com/Phala-Network/khala-parachain/blob/main/Cargo.toml#L60-L65
Thank you, I'll try those
@jasl @crystalin Do you still get this issue with the latest release?
Is there an existing issue?
Experiencing problems? Have you tried our Stack Exchange first?
Description of bug
When syncing a node the block import is very slow, to the point where block production is faster than block import. Instead of
sync
the logs showsPreparing
.The node is connected to peers with blocks we need.
There are also queued synced blocks (sync_queued_blocks metric)
Another interesting note is that the node does not know how far it needs to sync. Its sync_target is equal to best_block
Steps to reproduce
Start syncing a moonbem or moonriver node with archive parachain and pruned relay chain We are running moonbeam version
0.30.3
( using polkadot0.9.37
) Main flags used