paritytech / polkadot-sdk

The Parity Polkadot Blockchain SDK
https://polkadot.com/
1.92k stars 707 forks source link

p2p performance issues, regression? #6012

Open rvalle opened 1 month ago

rvalle commented 1 month ago

Hi!

We run full archive nodes for our analytics project. Typically we run on constrained resources. Our nodes typically run like clockwork with 2 peer in/out configuration.

Recently we updated our Polkadot node from docker version: v1.15.1 to v1.16.0 and p2p is starting to get stuck, even after doubling the peer count. This is something that we reported in the past, and perhaps for some reason, there was some kind of regression.

Here is what we are seeing now:

24h Screenshot 2024-10-10 at 15-21-57 View panel - Polkadot Node Monitoring - Starred - Grafana

Prior to this upgrade, p2p would not stuck at all, our very very rarely and for very little block could:

v1 15 1 Screenshot 2024-10-10 at 15-25-22 View panel - Polkadot Node Monitoring - Starred - Grafana

We reported a similar issue in the past, here is some reference: https://github.com/paritytech/polkadot/issues/6696#issuecomment-1495424865

Back them it was fixed.

lexnv commented 1 month ago

The issues was fixed in the past with:

That fix has been reverted by:

cc perf issue: https://github.com/paritytech/polkadot-sdk/issues/5221

bkchr commented 1 month ago

@lexnv can we close this if there exist another issue that tracks this?

lexnv commented 1 month ago

We still need to double-check this issue, it might expose some regressions we introduced between 0.15 and 0.16.

The perf issue should already affect 0.15, cc'ed to make sure we don't forget to check protocol performance (which we might have missed with libp2p update)

rvalle commented 1 month ago

@lexnv yes, 1.15 is also affected, attempted a rollback and issue is still there. Also notice bandwidth usage is in the order of 3x, when using low peer numbers. this was also reported before, and most possibly related.

Pay attention to the gap between bandwidth reported and used. A 2 peer node can work with less than 1-3Mb/s (in line with reported), its now averaging 8Mb/s with peaks of 13Mb/s, despite reporting much less.

rvalle commented 1 month ago

@lexnv I have not tested a lot, but today I changed to the little p2p backend, and it seems to be free from the issue:

Screenshot 2024-10-15 at 17-13-43 View panel - Polkadot Node Monitoring - Starred - Grafana

I would say that the bandwidth usage has increased compared to the pre-regression version but not as much as with the default implementation.