paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.com/
1.92k stars 707 forks source link

network: Investigate high memory consumption for long-running node #4927

Open lexnv opened 4 months ago

lexnv commented 4 months ago

2 Kusama nodes were started on 23th April and left running.

{__name__="substrate_build_info", chain="ksmcc3", instance="localhost:9615", job="substrate_node", name="gray-vase-1131", version="1.10.0-1a45bd88348"}

The commit was based on a2a049db2bd, from branch:

These are the same nodes as: https://github.com/paritytech/polkadot-sdk/issues/4925

Extracted metrics: metrics.txt

ps -eo size,pid,user,start,command --sort -size | grep polka                                                                                                                              

37124040 283194 ubuntu   Apr 23 ./target/release/polkadot -d /home/ubuntu/workspace/Kusama-db-full --chain kusama --port 30355 --pruning=1000 --network-backend litep2p --detailed-log-output

32125892 283378 ubuntu   Apr 23 ./target/release/polkadot -d /home/ubuntu/workspace/kusama-db --chain kusama --pruning=1000 --sync=warp --network-backend litep2p --detailed-log-output

# Comparing to a freshly started node
4649592 3017641 ubuntu 10:38:35 ./target/release/polkadot --prometheus-port 9616 --port 30344

The long-running nodes are roughly consuming 36253.95 Mb and 31372.94 Mb compared to a freshly started node4540.62 Mb

Metrics

CPU metric usage is consumed almost entirely by the network-worker and "libp2p-node" metrics. (note this is using the litep2p backend).

The "libp2p" metric has almost 38/40 running tasks running at a time.

Total network inbound: 2.6 TiB Total network outbound 806 GiB Node oscillated between 0 and 1 syncing peer.

Mpsc_import_notification_stream is the only channel with 160 messages queued (dashboard might be wrong). Chain_sync and Network_worker are sending messages, with occasional peer-set, network-gossip and transactions-handler-sync.

lexnv commented 4 months ago

Triaging local logs

Count      | Triage report
160155     | Notification block pinning limit reached. Unpinning block with hash = .*
2843       | 🥩 Error: .*. Restarting voter.
775        | .* banned, disconnecting, reason: .*
770        | 💔 Error importing block .*: .*
273        | \(offchain call\) Error submitting a transaction to the pool: .*
171        | Detected prevote equivocation in the finality worker: .*
102        | Detected precommit equivocation in the finality worker: .*
95         | ❌ Error while dialing .*: .*
42         | 🥩 ran out of peers to request justif #.* from
20         | Re-finalized block #.* \(.*\) in the canonical chain, current best finalized is #.*
10         | 💔 Called `on_validated_block_announce` with a bad peer ID .*
2          | Block import error: .*

Unkown
140  |  litep2p::ipfs::identify: inbound identify substream opened for peer who doesn't exist peer=PeerId(\"12D3KooWF3PWbXdGEuT35nBh3MgECtxnHng3s5c5QKapoDZMy38z\") protocol=/ipfs/id/1.0.0
4 | sync: 💔 Ignored block (#22873601 -- 0x649e…eab2) announcement from 12D3KooWBDbBuoE4umuzJnZcUouT4GY6n31BRWHXdAFsThjTKrug because all validation slots for this peer are occupied.
lexnv commented 4 months ago

A similar behavior can be seen with libp2p backend:

ps -eo size,pid,user,start,command --sort -size | grep polka                                                                                                                                                              Tue Jul  9 10:43:29 2024
28559276 473683 ubuntu   Jul 05 ./target/release/polkadot -d /home/ubuntu/workspace/kusama-db-libp2p --chain kusama --in-peers 50 --out-peers 50 --pruning=1000 --sync=warp --network-backend libp2p --prometheus-port 9616 --detailed-log-output
21120904 472124 ubuntu   Jul 05 ./target/release/polkadot -d /home/ubuntu/workspace/kusama-db-litep2p --chain kusama --pruning=1000 --in-peers 50 --out-peers 50 --sync=warp --network-backend litep2p --detailed-log-output

Considering that the node was not terminated after ~90 days by OOM killer, and litep2p backend consumes less memory than libp2p I would treat this issue with a lower priority for now.