paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.network/
1.62k stars 564 forks source link

pinning: Notification block pinning limit reached for warp-sync #4389

Open lexnv opened 1 month ago

lexnv commented 1 month ago

The following warning is continuously printed for a warp-sync kusama node (running litep2p backend) at around ~20 seconds intervals.

This warning did not reproduce however for a full-syncing node.

This was discovered during a long-running node triage (around 2 weeks running litep2p backend).

It seems that during the warp-sync the pinning limit is reached, then the pinning always keeps around 1k pinned blocks. When a new block is discovered, the LRU block is unpinned causing an warning. It might be the case that references to pinned blocks are kept around for too long (maybe because we have too many blocks).

2024-05-06 13:06:12.494  WARN tokio-runtime-worker db::notification_pinning: Notification block pinning limit reached. Unpinning block with hash = 0x354f531933d7a20b92e652ad0ec2034a74f645923f13ce2651056b2a951a7692

Warning is coming from:

https://github.com/paritytech/polkadot-sdk/blob/0fedb4dd008ee2adecdfa21b9362d568a9c64a48/substrate/client/service/src/client/notification_pinning.rs#L96-L103

cc @paritytech/sdk-node @paritytech/networking

skunert commented 1 month ago

I will warp sync a new kusama node on a dev machine and see if I can find anything fishy.

skunert commented 1 month ago

I was able to reproduce this. As a quick recap, we do not prune blocks that have FinalityNotifications or ImportNotifications floating around. Only when the UnpinHandle is dropped, the block can be pruned in the backend.

It looks like during initialization, the beefy worker is expecting a given header ancestry to be available and is basically waiting for headers to be available here: https://github.com/paritytech/polkadot-sdk/blob/efc2132fa2ece419d36af03c935b3c2c60440eb5/substrate/client/consensus/beefy/src/lib.rs#L640-L646

In my test example it took 6 hours for this loop to continue. But in the meantime we have a finality stream that is not getting polled here: https://github.com/paritytech/polkadot-sdk/blob/efc2132fa2ece419d36af03c935b3c2c60440eb5/substrate/client/consensus/beefy/src/lib.rs#L524 So there, the notifications pile up and the pinning cache limit is reached.

I am not too familiar with the beefy logic, but maybe we could drain the stream while working for the headers or, if we need to act on every notification, we could map them to a different type that does not include the handle.

@acatangiu What do you think?

bkchr commented 1 month ago

As I already proposed here: https://github.com/paritytech/polkadot-sdk/issues/3945#issuecomment-2041027723 the beefy networking should not be registered before beefy is active.

bkchr commented 1 month ago

(Basically this entire code should not do anything until BEEFY is ready.)

acatangiu commented 1 month ago

I believe this is a different case that would still be a problem even if BEEFY registers networking only after being active.

In case of warp sync, only mandatory headers get synced, and chain "normal operation" starts while rest of the headers sync in the background.

In either case, BEEFY worker needs to register for finality notifications because it is exclusively driven by GRANDPA finality. Simplified initialization steps:

  1. It checks for each finalized block if BEEFY protocol was initialized on chain (beefy_genesis set),
  2. Block N is finalized where BEEFY was initialized: beefy_genesis <= N,
    1. if beefy_genesis == N it's simple, just get initial BEEFY validators from the header,
    2. beefy_genesis < N, we need to get initial validators from beefy_genesis header - BUT when warp-syncing, this header is not yet available (coming in later), so without https://github.com/paritytech/polkadot-sdk/issues/1118 the workaround was to "just wait" for the sync to complete then resume processing GRANDPA finality stream (this comment explanation). We can't continue processing without the initial validator set, we also can't drop finality notifications without processing them.

Therefore, not helped by https://github.com/paritytech/polkadot-sdk/issues/3945#issuecomment-2041027723

acatangiu commented 1 month ago

I am not too familiar with the beefy logic, but maybe we could drain the stream while working for the headers or, if we need to act on every notification, we could map them to a different type that does not include the handle.

Yes, either drain the notifications and map/enqueue just the relevant data or just full header to some BEEFY worker queue - once the sync finished, worker needs to process the queue then switch back to processing notifications once it caught up.