libp2p / go-libp2p

libp2p implementation in Go
MIT License
5.84k stars 1.04k forks source link

Lotus sync issue: libp2p v0.34.1 #2858

Open rjan90 opened 1 week ago

rjan90 commented 1 week ago

Louts updated to libp2p v0.34.1 in its latests release Lotus v1.27.1, and we are getting some reports from users encountering syncing issues which seems to be related to the resource manager:

{"level":"debug","ts":"2024-07-02T07:14:40.235Z","logger":"rcmgr","caller":"resource-manager/scope.go:480","msg":"blocked stream from constraining edge","scope":"stream-16657929","edge":"transient","direction":"Inbound","current":4233,"attempted":1,"limit":4233,"stat":

{"NumStreamsInbound":0,"NumStreamsOutbound":4233,"NumConnsInbound":0,"NumConnsOutbound":0,"NumFD":0,"Memory":0},"error":"transient: cannot reserve stream: resource limit exceeded"}

Another report indicated that they were unable to get peers after updating, but after a couple of restarts of their node, they were able to get back in sync. Unfortunately they were not able to get a goroutine dump, but will do it next time they enounter the same issue.

Do you have any additional tips, for what information to gather when encopuntering these rcmgr-issues?

MarcoPolo commented 4 days ago

Setting up Grafana dashboards would really really help debug stuff like this.

The log points to an issue reserving a space for a new inbound stream in the "transient" scope. Streams are considered "transient" before we know what protocol they will be used for. Streams should only be transient until multistream finishes or the 10s timeout is hit, whichever comes first. I would be surprised if we were leaking "transient" streams, but it would be obvious in the dashboard if we are. Does this error log persist? Does the number in NumStreamsOutbound ever go down?

MarcoPolo commented 13 hours ago

We've tried to make it as easy as possible to get started with the dashboards, so there's a docker compose file that spins up everything you need. Refer to go-libp2p/dashboards/README.md for more detailed instructions. But, in case it helps, here are some step by step tips to get it working with lotus.

  1. go to go-libp2p/dashboards
  2. Make this change to prometheus.yml. This tells it about Lotus's default metrics endpoint
    diff --git a/dashboards/prometheus.yml b/dashboards/prometheus.yml
    index f0917188..bfa09fc5 100644
    --- a/dashboards/prometheus.yml
    +++ b/dashboards/prometheus.yml
    @@ -23,8 +23,8 @@ scrape_configs:
    honor_timestamps: true
    scrape_interval: 15s
    scrape_timeout: 10s
    -  metrics_path: /debug/metrics/prometheus
    +  metrics_path: /debug/metrics
    scheme: http
    static_configs:
    - targets:
    -    - host.docker.internal:5001
    +    - host.docker.internal:1234
  3. run docker compose -f docker-compose.base.yml up on macOS or docker compose -f docker-compose.base.yml -f docker-compose-linux.yml up on Linux.
  4. View metrics on Grafana at http://localhost:3000.