paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.com/
1.89k stars 690 forks source link

networking/litep2p: Node running litep2p seems to be leaking memory #6149

Open alexggh opened 3 weeks ago

alexggh commented 3 weeks ago

Looking over the dashboards on our kusama validators the memory on node that is running litep2p seems to be constantly increasing it is now at 12GiB, all other nodes are around 3-4 GiB and constant.

Screenshot 2024-10-21 at 12 01 07

https://grafana.teleport.parity.io/goto/Uoh4CQmHg?orgId=1

cc: @paritytech/networking

lexnv commented 3 weeks ago

Thanks Alex for raising this! 🙏

We also had this issue with addresses, although the memory consumed here is in the order of GiB (https://github.com/paritytech/polkadot-sdk/pull/5998).

There are a few places that come to mind where to look at next:

I would start by looking at litep2p and then move to substrate code

lexnv commented 3 weeks ago

Initial testing with debug logs built on top of branch lexnv/holistic-litep2p-test-dhtandpeerset shows memory leaks in litep2p:

Will let my node running over night and followup with patches

Screenshot 2024-10-21 at 19 32 21

lexnv commented 3 weeks ago

We need metrics to filter out potential leaks (ie monotonically increasing state tracking is a concern).

We have around 3 separate leaks:

1. Transport Manager State Leak

Screenshot 2024-10-22 at 12 39 27 Screenshot 2024-10-22 at 12 38 16

2. TCP/WebSocket Pending Dials Leak

Screenshot 2024-10-22 at 12 40 57

Screenshot 2024-10-22 at 14 01 57

3. TCP/WebSocket Cancellation Logic Leak

Screenshot 2024-10-22 at 14 02 45 Screenshot 2024-10-22 at 14 03 38

Screenshot 2024-10-22 at 12 41 49

Screenshot 2024-10-22 at 12 43 50

lexnv commented 3 weeks ago

Litep2p PRs

For more details and explained edge-cases when the leaks happen see:

lexnv commented 2 weeks ago

Lower severity memory leaks in the ping and identify protocols: