Closed nisdas closed 3 months ago
Just to be clear, you did not see this issue on a prior version of go-libp2p? Which version was that?
@MarcoPolo Why is the 35.1 branch missing in the master branch? Can you merge it?
Are you running runtime.GC()
before capturing the profile? The dangling memory from the connection may not have been GC'd yet. When a quic connection is active it handles incoming packets by registering the conn in a map. When a conn is closed the value in the map is overwritten, but the conn is still on the heap waiting to be cleaned up by the GC.
If I try to repro this issue without doing runtime.GC()
when capturing the heap profile I see a similar flamegraph:
However, if I call runtime.GC()
to ensure I'm cleaning up any pending items I don't see anything out of the ordinary.
Here's my test code: https://gist.github.com/MarcoPolo/faca66afeef81edf84e880148d22a601
Are you sure it's a memory leak? How many QUIC connections are you handling? An easy way to check is using the Swarm Prometheus dashboard.
protocol.MaxConnUnprocessedPackets
is 256. receivedPacket
weighs 112 bytes, so this is about ~ 29 kB per QUIC connection. The 600 MB you're seeing would therefore correspond to roughly 2000 QUIC connection. Pre-allocating this much memory is not ideal, and we'll most likely refactor this logic when implementing support for GRO.
Yeap, we have updated from v0.33.1
@MarcoPolo . This wasn't an issue before on v0.33.1 , possibly something was refactored/changed with the dialer/swarm in either v0.34.x or v0.35.x ? I can try downgrading to v0.34.x to see if its still an issue there.
runtime.GC()
is called by the go runtime periodically as this is a live node rather than an isolated test. This node has been running for 3 weeks which means the runtime isn't clearing this out for some reason. Do you have nodes for other network running on go-libp2p reporting any similar issues ?
@marten-seemann Does very much appear to be on our end, my live node that was running has again had an increased heap from yesterday of about 50mb. At any one time there are only 15-20 active quic connections on my node. It has a maximum peer count of 70.
# HELP libp2p_swarm_connections_closed_total Connections Closed
# TYPE libp2p_swarm_connections_closed_total counter
libp2p_swarm_connections_closed_total{dir="inbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 13
libp2p_swarm_connections_closed_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 1
libp2p_swarm_connections_closed_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 6
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 120908
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 225525
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 362962
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="",security="",transport="quic-v1"} 33
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 80
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 32158
# HELP libp2p_swarm_connections_opened_total Connections Opened
# TYPE libp2p_swarm_connections_opened_total counter
libp2p_swarm_connections_opened_total{dir="inbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 13
libp2p_swarm_connections_opened_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 1
libp2p_swarm_connections_opened_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 6
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 120925
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 225533
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 362993
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="",security="",transport="quic-v1"} 33
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 80
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 32171
# HELP libp2p_swarm_dial_errors_total Dial Error
# TYPE libp2p_swarm_dial_errors_total counter
libp2p_swarm_dial_errors_total{error="canceled: concurrent dial successful",ip_version="ip4",transport="tcp"} 84201
libp2p_swarm_dial_errors_total{error="canceled: concurrent dial successful",ip_version="ip6",transport="tcp"} 17
libp2p_swarm_dial_errors_total{error="connection refused",ip_version="ip4",transport="tcp"} 37979
libp2p_swarm_dial_errors_total{error="deadline",ip_version="ip4",transport="quic-v1"} 29
libp2p_swarm_dial_errors_total{error="deadline",ip_version="ip4",transport="tcp"} 210743
libp2p_swarm_dial_errors_total{error="deadline",ip_version="ip6",transport="tcp"} 11
libp2p_swarm_dial_errors_total{error="other",ip_version="ip4",transport="quic-v1"} 347816
libp2p_swarm_dial_errors_total{error="other",ip_version="ip4",transport="tcp"} 216050
libp2p_swarm_dial_errors_total{error="other",ip_version="ip6",transport="quic-v1"} 65
libp2p_swarm_dial_errors_total{error="other",ip_version="ip6",transport="tcp"} 1
libp2p_swarm_dial_errors_total{error="timeout",ip_version="ip4",transport="quic-v1"} 112813
libp2p_swarm_dial_errors_total{error="timeout",ip_version="ip4",transport="tcp"} 62900
libp2p_swarm_dial_errors_total{error="timeout",ip_version="ip6",transport="quic-v1"} 1
libp2p_swarm_dial_errors_total{error="timeout",ip_version="ip6",transport="tcp"} 4
Swarm metrics from my node
I can try downgrading to v0.34.x to see if its still an issue there.
v0.34 and v0.35 are very similar. There was just a small breaking change. I wouldn't expect it to behave differently.
Could you share repro steps? Maybe I can try with Kubo. Do you have a sense of the connection churn per hour? e.g. could you share the metrics of a node an hour apart? (or even 4 measurements 15 min apart)
I wonder if you disconnected the node from the internet for a bit (1 minute?) and ran runtime.GC()
if you would still see the issue.
Could you share repro steps? Maybe I can try with Kubo. Do you have a sense of the connection churn per hour? e.g. could you share the metrics of a node an hour apart? (or even 4 measurements 15 min apart)
I can share the metrics in a bit, for Kubo just running this with both TCP and QUIC transports enabled for an extended period of time should show something. These are the options we currently use https://github.com/prysmaticlabs/prysm/blob/develop/beacon-chain/p2p/options.go#L63
Some connection metrics over 1 hour at 15 min intervals
Measurement 1:
# TYPE libp2p_swarm_connections_closed_total counter
libp2p_swarm_connections_closed_total{dir="inbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 13
libp2p_swarm_connections_closed_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 2
libp2p_swarm_connections_closed_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 6
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 121470
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 226686
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 364902
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="",security="",transport="quic-v1"} 33
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 81
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 32332
# HELP libp2p_swarm_connections_opened_total Connections Opened
# TYPE libp2p_swarm_connections_opened_total counter
libp2p_swarm_connections_opened_total{dir="inbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 13
libp2p_swarm_connections_opened_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 2
libp2p_swarm_connections_opened_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 6
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 121494
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 226694
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 364926
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="",security="",transport="quic-v1"} 33
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 81
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 32345
Measurement 2:
libp2p_swarm_connections_closed_total Connections Closed
# TYPE libp2p_swarm_connections_closed_total counter
libp2p_swarm_connections_closed_total{dir="inbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 13
libp2p_swarm_connections_closed_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 2
libp2p_swarm_connections_closed_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 6
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 121518
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 226789
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 365107
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="",security="",transport="quic-v1"} 33
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 81
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 32342
# HELP libp2p_swarm_connections_opened_total Connections Opened
# TYPE libp2p_swarm_connections_opened_total counter
libp2p_swarm_connections_opened_total{dir="inbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 13
libp2p_swarm_connections_opened_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 2
libp2p_swarm_connections_opened_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 6
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 121540
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 226797
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 365129
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="",security="",transport="quic-v1"} 33
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 81
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 32356
Measurement 3:
libp2p_swarm_connections_closed_total Connections Closed
# TYPE libp2p_swarm_connections_closed_total counter
libp2p_swarm_connections_closed_total{dir="inbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 13
libp2p_swarm_connections_closed_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 2
libp2p_swarm_connections_closed_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 6
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 121573
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 226915
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 365254
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="",security="",transport="quic-v1"} 33
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 81
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 32364
# HELP libp2p_swarm_connections_opened_total Connections Opened
# TYPE libp2p_swarm_connections_opened_total counter
libp2p_swarm_connections_opened_total{dir="inbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 13
libp2p_swarm_connections_opened_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 2
libp2p_swarm_connections_opened_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 6
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 121591
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 226925
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 365280
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="",security="",transport="quic-v1"} 33
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 81
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 32376
Measurement 4:
libp2p_swarm_connections_closed_total Connections Closed
# TYPE libp2p_swarm_connections_closed_total counter
libp2p_swarm_connections_closed_total{dir="inbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 13
libp2p_swarm_connections_closed_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 2
libp2p_swarm_connections_closed_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 6
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 121626
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 227011
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 365424
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="",security="",transport="quic-v1"} 33
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 81
libp2p_swarm_connections_closed_total{dir="outbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 32375
# HELP libp2p_swarm_connections_opened_total Connections Opened
# TYPE libp2p_swarm_connections_opened_total counter
libp2p_swarm_connections_opened_total{dir="inbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 13
libp2p_swarm_connections_opened_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 2
libp2p_swarm_connections_opened_total{dir="inbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 6
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="",security="",transport="quic-v1"} 121645
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/mplex/6.7.0",security="/noise",transport="tcp"} 227021
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 365453
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="",security="",transport="quic-v1"} 33
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="false",ip_version="ip6",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 81
libp2p_swarm_connections_opened_total{dir="outbound",early_muxer="true",ip_version="ip4",muxer="/yamux/1.0.0",security="/noise",transport="tcp"} 32385
In Prysm, we have support for running Prysm nodes both with the Quic transport and the TCP transport. Recently we updated our version to be v0.35.0. This has been working for the most part, however while profiling long-running nodes running we have come across what appears to be a memory leak. Over the course of a few weeks the heap occupied by a prysm node has grown rather than stayed constant size. If you look at the heap profiles below, you can see the large amount of memory allocated by the respective quic and tcp dialers.
The above 2 screenshots capture the heatmaps occupied by QUIC connections. If you take a look, it is sizeable amount occupied by the pre-setup step of quic connections:
I am not very familiar with the internals of quic-go but it appears these streams/connections are not being appropriately garbage collected by go-libp2p ? You have old connections somehow still living on the heap.
Finally we did a profile differential for the same node of its current state and its state a few days ago:
With this heatmap it becomes obvious where the memory leak is coming from, it does appear the new release has introduced either a regression or new bug for old connections.
Version Information