High warnings rate on Kusama after 0.9.31-rc6 update due to libp2p_yamux

vstakhov commented 2 years ago

After our validators got updated to this version, there is a constant flow of warnings in the module libp2p_yamux. The logs are all similar:

WARN tokio-runtime-worker libp2p_yamux: dropping (Stream a9bc8821/65) because buffer is full

The sample warnings rate graph is here: https://grafana.parity-mgmt.parity.io/goto/V_rs4jHVk?orgId=1

CC @dmitry-markin @altonen

dmitry-markin commented 2 years ago

Looking at the place where this warning originates in libp2p, it looks like this is happening because a remote node is opening yamux substreams faster than our node is accepting them. The limit of requested but not yet accepted substreams is set to 25 here, and it seems to be the part of a substream backpressure machanism. It's likely that this is happening due to the overall high load of the node, and I don't quite understand how to optimize it now.

@tomaka do you have any ideas on the matter?

tomaka commented 2 years ago

This is really poor coding, but given the API constraints of libp2p there's no better way to implement it.

The proper fix would be to rewrite the entire networking stack (including libp2p), but an alternative fix is to raise this limit. 25 seems extremely low to me no matter way. The struct seems to have a size of around 40-50 bytes, so limiting to 25 elements limits the memory usage to 1.2 kiB. Even if we set the limit to 250000 it would still be reasonable.

They'd probably accept a PR that makes this constant configurable.

vstakhov commented 2 years ago

But why has it happened just after upgrade? Was it a coincidence or was it somehow caused by some our change (e.g. https://github.com/paritytech/substrate/pull/12434)?

tomaka commented 2 years ago

Probably a change that causes more substreams to be opened at the same time, or changes the CPU profile of the node in such a way that there is now a longer delay between two polling of the networking, or something like that. If anything, https://github.com/paritytech/substrate/pull/12434 should have improved the situation.

ordian commented 1 year ago

We're observing a lot of these warnings recently on Kusama. We're also obverving occasionally these errors:

ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a bad peer ID

Looking at the code the latter seems to be a race condition with peer disconnecting possibly due to the same reason causing the warning.

The reputation changes logs for reputation changes by polkadot seem to be mostly about "Not interested in that parent hash" in bitfield-distribution, but they are not nearly enough to cause a disconnect.

ordian commented 1 year ago

We also higher finality lag. As you can see approval-checking lag is low and disputes-lag is zero. So that means GRANDPA votes are propagated or accounted for slower. Probably this error explains it:

ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a bad peer ID

eskimor commented 1 year ago

This is about multiplexed streams on a single TCP connnection - right? Can we add some logging to see the protocol name of the stream that gets dropped?

eskimor commented 1 year ago

Should already be fixed? Maybe just an uprade to 0.50 is necessary @dmitry-markin ?

ordian commented 1 year ago

Should already be fixed? Maybe just an uprade to 0.50 is necessary @dmitry-markin ?

The PR you linked was released in libp2p-yamux 0.41.1: https://github.com/libp2p/rust-libp2p/blob/v0.50.0/muxers/yamux/CHANGELOG.md and the recent polkadot release: https://github.com/paritytech/polkadot/blob/v0.9.36/Cargo.lock#L3727 has 0.41.0. So cargo update -p libp2p-yamux should do the trick if the fix is indeed relevant here.

bkchr commented 1 year ago

@mxinden is it possible for libp2p to require the latest versions of their deps when doing a new release? This way we would have get these kind of updates automatically.

ordian commented 1 year ago

The Versi burn-in of paritytech/polkadot#6464 looks promising so far. The libp2p_yamux warnings have disappeared.

However, ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a bad peer ID still seem to be there. We also observe:

WARN tokio-runtime-worker sync: 💔 Invalid block justification provided by 12D3KooWQREC4ZFVMRzn4sZnK2emxw1zQ6b4UaYQ9XXueNkzzXq5: requested: 0x070a207bc9970053529aaa3e90a7c7ae7509abb1733448583b58901acfb2ec2e got: 0x61c5297308d8a54489bc2d3db79f8d1dd60beefe600a9033dc04f3803c4dcadd
WARN tokio-runtime-worker sync: 💔 Error importing block <snip>: block has an unknown parent
WARN tokio-runtime-worker sync: 💔 Verification failed for block 0xb70f9820ad892689d2fdbfa9e2067348de4cfb96ae37ce880c36d4a93af81f36 received from peer: 12D3KooWLNZ8yhBLSD6dGHAio2RKsnFPNaBtk6cdagv9nxRHVsd1, "UnknownBlock: State already discarded for 0xe2887aa9553bcfa47ec80016f209e1c194244c9c263ea3c5f71c296a104d2150"

bkchr commented 1 year ago

@ordian can we get a new issue for this with more logs? Maybe also from the other side that send this invalid justifications.

mxinden commented 1 year ago

Sorry for the trouble the initial patch caused here. Great to see that you are using libp2p-yamux >=v0.41.1.

@mxinden is it possible for libp2p to require the latest versions of their deps when doing a new release? This way we would have get these kind of updates automatically.

Given that downstream projects like Substrate need some mechanism to upgrade to patch versions of direct dependencies, why not use the same mechanism for dependencies of dependencies? E.g. why not use dependabot to notify and upgrade to patch releases of all dependencies?

We could release a new libp2p patch version for every libp2p-* patch version. In cases where the patch of the libp2p-* release does not apply to all, that would unnecessary restrict version choice of users that don't need the patch. With the above in mind, namely that downstream users need a mechanism to update direct dependencies anyways, is your proposal worth it?

bkchr commented 1 year ago

We could release a new libp2p patch version for every libp2p-* patch version. In cases where the patch of the libp2p-* release does not apply to all, that would unnecessary restrict version choice of users that don't need the patch. With the above in mind, namely that downstream users need a mechanism to update direct dependencies anyways, is your proposal worth it?

I didn't mean a patch version of libp2p. I meant on every major/minor version release of libp2p you would pin the sub libp2p dependencies with their latest patch release.

tdimitrov commented 1 year ago

ERROR tokio-runtime-worker sync: 💔 Called on_block_justification with a peer ID of an unknown peer occurred twice on Kusama this week. Time range of the second occurrence -> 2023-04-08 12:27:01 to 2023-04-08 12:29:34 UTC.

altonen commented 1 year ago

@tdimitrov

This is a known issue. Happens because the response queue in syncing may contain "stale responses", responses from peers who have disconnected. I'll try to fix it this week.

mxinden commented 1 year ago

We could release a new libp2p patch version for every libp2p-* patch version. In cases where the patch of the libp2p-* release does not apply to all, that would unnecessary restrict version choice of users that don't need the patch. With the above in mind, namely that downstream users need a mechanism to update direct dependencies anyways, is your proposal worth it?

I didn't mean a patch version of libp2p. I meant on every major/minor version release of libp2p you would pin the sub libp2p dependencies with their latest patch release.

For the record, this will most likely be the new convention starting with the next breaking rust-libp2p release. https://github.com/libp2p/rust-libp2p/pull/3715

thomaseizinger commented 1 year ago

Looking at the place where this warning originates in libp2p, it looks like this is happening because a remote node is opening yamux substreams faster than our node is accepting them. The limit of requested but not yet accepted substreams is set to 25 here, and it seems to be the part of a substream backpressure machanism. It's likely that this is happening due to the overall high load of the node, and I don't quite understand how to optimize it now.

Unfortunately, yamux doesn't have any backpressure mechanism on newly opened streams. Thus, if the current node is slower in accepting new streams than the remote is in opening new ones, our only choice is to drop new streams to avoid a resource exhaustion. There is unfortunately no correct magic number here on when to do that.

What we can do is try and have all nodes play nice and stop opening new streams when they realize that the other node is slow. In yamux, we call this the "ACK backlog" (opened but not yet acknowledged streams). With a recent refactoring in rust-yamux, this is now possible to implement. It is tracked here: https://github.com/libp2p/rust-yamux/issues/150

We can then align our magic number to the magic number for the allowed ACK backlog (256 as per spec). At that point, I think we should effectively no longer drop streams anywhere, assuming all implementations adhere to the spec (and thus to the 256 ACK backlog limit).

Help in getting there would be much appreciated!

lexnv commented 4 months ago

@vstakhov are we still getting this error with the latest libp2p version bump?

I did not see this running a local node for ~ 4 days with libp2p backend. And did not reproduce during versi-net triaging

ordian commented 4 months ago

Indeed, don't see these warnings on our Kusama validators in the past 7 days.

paritytech / polkadot-sdk

High warnings rate on Kusama after 0.9.31-rc6 update due to libp2p_yamux #758