paritytech / substrate

Substrate: The platform for blockchain innovators
Apache License 2.0
8.39k stars 2.65k forks source link

Memory consumption with large Transaction Pool #11503

Open shunsukew opened 2 years ago

shunsukew commented 2 years ago

Is there an existing issue?

Experiencing problems? Have you tried our Stack Exchange first?

Description of bug

Substrate node with large transaction pool limit configuration (for e.g. --pool-limit 65536. larger than default pool limit.) consumes whole Mem (32GB) of the machine when pooled transactions count hits around 20k. Memory usage grows rapidly and reaches 100% of 32GB memory.

Are there any potential issues around Transaction Pool? such as memory leak.

Case 1. Transaction Pool 20k (2022-05-22 21:50:00 ~ 2022-05-22 23:00:00 UTC +8) Transaction pool

スクリーンショット 2022-05-22 23 07 17

Mem

スクリーンショット 2022-05-22 23 06 58

CPU

スクリーンショット 2022-05-22 23 06 47

Once memory usage hits 100%, machine will become not reachable.

Case 2. Default Transaction Pool Limit (2022-05-22 23:20:00 ~ 2022-05-22 23:40:00 UTC +8) Transaction Pool

スクリーンショット 2022-05-22 23 40 10

Mem

スクリーンショット 2022-05-22 23 40 01

CPU

スクリーンショット 2022-05-22 23 39 51

(Machine Spec) CPU optimized machine (Fast CPU) 16 vCPU 32GB Mem General Purpose SSD - 16KiB IOPS & throughput 250 MiB/s

Steps to reproduce

Set default pool limit --pool-limit more than 20k and have +19k transactions in the pool. (I did this by running Astar node and sync blocks with peers as of 2022/05/23.)

bkchr commented 2 years ago

Did you also changed --pool-kbytes?

shunsukew commented 2 years ago

@bkchr Thank you for the comment. No, I don't. That means default value is used?

--pool-kbytes <COUNT>
            Maximum number of kilobytes of all transactions stored in the pool [default: 20480]
bkchr commented 2 years ago

@koute could you may look into this?

koute commented 2 years ago

@koute could you may look into this?

Sure; I'm on it.

koute commented 2 years ago

The issue doesn't seem to reproduce on a normal Kusama node (or maybe it just needs to be sync'd from scratch; I haven't checked yet), however I think I've managed to reproduce it on the newest astar-collator (I haven't let it run until memory exhaustion, but it looks like the memory's growing). I'm profiling it to see why it is growing.

koute commented 2 years ago

@shunsukew For reference, can you provide the exact command line you've used to launch your node?

koute commented 2 years ago

So I think I see the memory usage increase, but it's nowhere near as fast as on the screenshots posted by @shunsukew. I'll leave it running overnight (and if it doesn't reproduce I'll try maybe spamming it with fake transactions), however it'd be nice if there was a way I could reproduce it to behave as in the original issue as that would make it a lot easier to investigate.

In the meantime I've also noticed that the Astar node uses the system allocator and doesn't use jemalloc like Polkadot does; this is not good, and it might contribute to the problem. (I could check if I knew how to exactly reproduce it.) I've put up a PR here enabling jemalloc for your node: https://github.com/AstarNetwork/Astar/pull/653

bLd75 commented 2 years ago

Hi @koute thank you very much for the PR!

Below are tests made on a collator node with this simple command (before and after change made ~19:15): /usr/local/bin/astar-collator --collator --rpc-cors all --name collator --base-path /var/lib/astar --state-cache-size 0 --prometheus-external --pool-limit 65536 --port 30333 --chain astar --parachain-id 2006 --telemetry-url 'wss://telemetry.polkadot.io/submit/ 0' I think node has to be fully sync to reproduce. Previous data reported was from a public node (archive mode).

Metrics on the same time frame

Transaction queue image

RAM (32Gb total) increases fast but doesn't get totally full from the beginning: image

CPU consumption doesn't change much but gets higher image

Peers number gets unstable image

Network traffic increases in huge proportions, the node is sending incredible amount of data image

I will test your PR just after as a next step.

shunsukew commented 2 years ago

@koute @bLd75 Thank you for the PR and additional information