Lost peers due to state-cache

GopherJ commented 3 years ago

After upgrading to polkadot-v0.9.10 & polkadot-v0.9.11, we start to observe no peers issue on some of our collators, while turning on sync=trace I found the following logs:

GopherJ commented 3 years ago

I'd really like to know why it's refused? is it because of a bad reputation? how can I check this, thanks

GopherJ commented 3 years ago

the connection to another three bootnodes is refused:

2021-10-31 02:04:46.972 DEBUG tokio-runtime-worker sync: [Parachain] Connected 12D3KooWLUTzbrJJDowUKMPfEZrDY6eH8HXvm8hrG6YrdUmdrKPz
2021-10-31 02:04:46.973 DEBUG tokio-runtime-worker sync: [Parachain] Request to peer PeerId("12D3KooWLUTzbrJJDowUKMPfEZrDY6eH8HXvm8hrG6YrdUmdrKPz") failed: Refused.
2021-10-31 02:04:46.974 DEBUG tokio-runtime-worker sync: [Parachain] 12D3KooWLUTzbrJJDowUKMPfEZrDY6eH8HXvm8hrG6YrdUmdrKPz disconnected

2021-10-31 02:06:32.190 DEBUG tokio-runtime-worker sync: [Parachain] Connected 12D3KooWA8jSwEbscptbwv1KqY7d7n2qURbd6zUaaPvzTVBMMgSd
2021-10-31 02:06:32.191 DEBUG tokio-runtime-worker sync: [Parachain] Request to peer PeerId("12D3KooWA8jSwEbscptbwv1KqY7d7n2qURbd6zUaaPvzTVBMMgSd") failed: Refused.
2021-10-31 02:06:32.191 DEBUG tokio-runtime-worker sync: [Parachain] 12D3KooWA8jSwEbscptbwv1KqY7d7n2qURbd6zUaaPvzTVBMMgSd disconnected

2021-10-31 02:07:03.711 DEBUG tokio-runtime-worker sync: [Parachain] Connected 12D3KooWL63x8ZPkY2ZekUqyvyNwsakwbuy8Rq3Dt9tJcxw5NFTt
2021-10-31 02:07:03.810 DEBUG tokio-runtime-worker sync: [Parachain] Request to peer PeerId("12D3KooWL63x8ZPkY2ZekUqyvyNwsakwbuy8Rq3Dt9tJcxw5NFTt") failed: Refused.

but seems no reason was given

GopherJ commented 3 years ago

https://github.com/paritytech/substrate/blob/5e2b0f041c8a10c3ceb06721389e1d5c95380e65/client/network/src/protocol.rs#L1480-L1483

GopherJ commented 3 years ago

@tomaka sorry to ping you, we would like to fix this issue as it's on live, do you mind having a look?

GopherJ commented 3 years ago

just discoverred it's back now with 17 peers so this error happened for a few hours

GopherJ commented 3 years ago

this error happened again on one of our collator: https://telemetry.polkadot.io/#list/0x64a1c658a48b2e70a7fb1ad4c39eea35022568c20fc44a6e2e3d0a57aee6053b

tomaka commented 3 years ago

The Refused error happens if the reputation of the source node is too low, or if the node targeted by the request is overloaded and is uncapable of processing more block requests (to prevent DoS attacks)

GopherJ commented 3 years ago

The Refused error happens if the reputation of the source node is too low, or if the node targeted by the request is overloaded and is uncapable of processing more block requests (to prevent DoS attacks)

How to check the reputation? The latter shouldn't be our case because it's our own node and we didn't find any potential DDOS attack yet.

tomaka commented 3 years ago

You can use -l peerset=trace to see all reputation changes. If there's no error in the logs, it is likely that this doesn't relate to reputations. Instead, chances are that the CPU or disk the nodes are running on isn't powerful enough.

GopherJ commented 3 years ago

ok let me test

GopherJ commented 3 years ago

I saw this, from this disconnected collator's perspective, the others have a pretty bad reputation

If I check this reputation from other nodes' perspective, this disconnected node has 0 as reputation

tomaka commented 3 years ago

Reputations start being "bad" below approximately -2^28 if I remember correctly. The reputations displayed here shouldn't have any consequence.

GopherJ commented 3 years ago

@tomaka so you are saying that we should increase VPS's size right. Currently we are using 8Cores + 16GB and 10Gbps for our collators, do you have suggestions?

tomaka commented 3 years ago

This should be more than enough, as far as I know.

In my opinion, something (i.e. some event) happened and caused the nodes to act very slowly for a few hours. It would be nice to figure out what this "something" is, but we don't have any clue here unfortunately. Getting a more powerful machine would mitigate the problem if/when it happens again.

GopherJ commented 3 years ago

@tomaka ok we will investigate it and let you know, great thanks on the help

GopherJ commented 3 years ago

@tomaka we didn't find any attack, so I believe it's ok on this part, also we don't have any error logs yet, but we constantly have one collator disconnected with the others

https://telemetry.polkadot.io/#/0x64a1c658a48b2e70a7fb1ad4c39eea35022568c20fc44a6e2e3d0a57aee6053b

this time it's heiko-collator-0

GopherJ commented 3 years ago

hi @tomaka I think there is a bug, restarting the nodes can help solving this for some hours but then it starts to lose peers again.

do you have some advice that we can apply, util now we don't have any error logs, the number of peers just go down periodically

GopherJ commented 3 years ago

this really doesn't look cool, what type of information can I post to help fixing this issue?

tomaka commented 3 years ago

Please check the CPU usage of the nodes and see if it correlates with the disconnections.

GopherJ commented 3 years ago

@tomaka

I think it looks totally fine. Actually we didn't find anything strange except Request to peerid failed: Refused

GopherJ commented 3 years ago

hi @tomaka we are still suffering from this issue, we are restarting our collators everyday:) it helps but not much

https://telemetry.polkadot.io/#/0x64a1c658a48b2e70a7fb1ad4c39eea35022568c20fc44a6e2e3d0a57aee6053b

as you can see from the following, heiko-collator-4 and heiko-collator-7 got refused by the others today

GopherJ commented 2 years ago

I tried to use reserved-nodes but it's also disconnecting no matter what I do

GopherJ commented 2 years ago

after adding --state-cache-size 1 it looks much better now but not sure yet if it's resolved

GopherJ commented 2 years ago

since these three days our collators are working fine by adjusting --state-cache-size to 1, I think this issue can be closed

nazar-pc commented 2 years ago

I see the same issue on my network with a few nodes. There was nothing special happening there, node was working fine for a few days and suddenly that happened. Shouldn't it "just work" without need to disable state cache?

I think this should be reopened for further investigation, there is clearly something wrong if it happens on simple network with no misbehavior.

GopherJ commented 2 years ago

@tomaka @bkchr @nazar-pc I'm reopening this because there are definitely some issues with the cache.

bkchr commented 2 years ago

With what cache? The state cache?

nazar-pc commented 2 years ago

We have a simple and reliable way to reproduce it (though inconvenient).

With blockchain size (chains directory) of ~200G and 2 nodes available, third node joining the network is fairly quickly banned by other nodes. This happens every time long before 100G mark. Last attempt failed after just a few gigabytes.

Are there some parameters right now that can be tweaked to fix this behavior?

bkchr commented 1 year ago

@nazar-pc is this still relevant?

nazar-pc commented 1 year ago

Likely yes. What we did on our side is forking Substrate to decrease amount of data that block range response can contain and together with https://github.com/paritytech/substrate/pull/12146 it got much better, but I don't believe it is solved fundamentally.

bkchr commented 1 year ago

What limit did you decrease?

nazar-pc commented 1 year ago

MAX_BODY_BYTES: https://github.com/paritytech/substrate/commit/5f851f1a8322421c1562f00c6ec2de2332abfebd

bkchr commented 1 year ago

CC @altonen the title here is probably misleading at this should not be related to the state-cache (which also not exists anymore today), but maybe some networking issue when it is improved by decreasing the mentioned max body size.

@nazar-pc could you provide your mentioned reproduction?

nazar-pc commented 1 year ago

I'm afraid it is no longer there as we are in the middle between previous and current testnet right now. I can request the team to try and reproduce it, but it may take a while.

The gist of our observation is the following: node was making request, but response was so big (initially significantly bigger than default MAX_BODY_SIZE due to old implementation logic) that node wasn't able to always pull it within keepalive timeout (I believe this is how it is called). With https://github.com/paritytech/substrate/pull/12146 and MAX_BODY_BYTES decrease we achieved lower maximum expected bandwidth requirement from nodes before timeout kicks in and that way connection wasn't dropped. So the peers are connected to each other and they successfully transfer data, but because a single response takes so long (it shouldn't block TCP connection though because messages are fragmented by muxer anyway) it was having such effect as if connection was idling for a long time and timing out.

This was primarily reproducible in slow network conditions, like pulling something behind Chinese firewall from server in EU as one of easier to reproduce examples. Those who had strong connectivity didn't have issues, but we are supporting diverse set of users in our protocol and need to be prepared to these less than optimal conditions.

altonen commented 1 year ago

I can request the team to try and reproduce it, but it may take a while.

This would be much appreciated. @dmitry-markin @melekes could you take a look at this?

bkchr commented 1 year ago

So the peers are connected to each other and they successfully transfer data, but because a single response takes so long (it shouldn't block TCP connection though because messages are fragmented by muxer anyway) it was having such effect as if connection was idling for a long time and timing out.

Ty! Yeah maybe our time out mechanism isn't implemented properly. I don't know how it works exactly, but it should detect that there is progress being made and not require some kind of ping-pong (I think that is what we are doing) in these cases.

dmitry-markin commented 1 year ago

I can request the team to try and reproduce it, but it may take a while.

This would be much appreciated. @dmitry-markin @melekes could you take a look at this?

I can have a look at this and try reproducing the issue with timeouts on slow networks after merging some in-progress PRs.

dmitry-markin commented 1 year ago

May be connected issue: https://github.com/paritytech/substrate/issues/12105

dmitry-markin commented 1 year ago

So far it looks like the theoretical bandwidth requirements based on MAX_BODY_BYTES of 8 MB and block request protocol timeout set to 20 s is 8 MB / 20 s = 3.4 Mbit/s. In my (arbitrary) tests with syncing polkadot chain, 1 Mbit/s was enough to download blocks around block number 14357815, but 512 kbit/s connection limit lead to block request protocol timing out after 20 seconds, and the peer we are syncing from being disconnected and backed-off. The things can be theoretically improved client-side by reducing the number of blocks requested in a single request in MAX_BLOCKS_TO_REQUEST. Reducing it from 64 to 32 made sync work in my setup on 512 kbit/s.

It looks like there is no solution other than "turning knobs here and there" without modifying libp2p, because libp2p considers the whole request to be timed out if the response was not received in full. On the other hand, I'm not sure that keeping connections alive if they can't meet our timeout requirements and just transmit something really makes sense (e.g., slowloris attack?)

nazar-pc commented 1 year ago

Yes, we have decreased requirements in our protocol to ~1Mbps, unfortunate things is that those are constants and we had to fork Substrate to make those changes.

bkchr commented 1 year ago

It looks like there is no solution other than "turning knobs here and there" without modifying libp2p, because libp2p considers the whole request to be timed out if the response was not received in full. On the other hand, I'm not sure that keeping connections alive if they can't meet our timeout requirements and just transmit something really makes sense (e.g., slowloris attack?)

But could we not be smarter by starting with less blocks, calculate the bandwidth and then increase the blocks to request based on what we measured? We would then only need some minimum bandwidth requirements.

dmitry-markin commented 1 year ago

But could we not be smarter by starting with less blocks, calculate the bandwidth and then increase the blocks to request based on what we measured? We would then only need some minimum bandwidth requirements.

In theory we could, but adaptive congestion control on the application layer would be more complicated than just calculating the bandwidth once and setting the block count in the request. We should also account for the change of network conditions, influence of parallel transmission of data by other protocols / parallel download from multiple peers (not sure if we have this), control stability issues. I.e., the congestion control algorithms in TCP are there for a reason.

We can implement some POC, but the algorithm must be extremely dumb and straightforward to not introduce new issues (especially I worry for the stability of the feedback loop).

altonen commented 1 year ago

I agree with @dmitry-markin here

There's already so much abstraction below the request-response protocol that implementing adaptive streaming all the way up there sounds quite difficult to get working properly, except for the dumb case. The request-response protocol would have imperfect information as to what is happening in the stream because other protocols are also being multiplexed and consume some amount of the available bandwidth. Also once the amount of bandwidth is over-estimated and too much data is sent, the connection is closed which is quite a bit more harsh than TCP dropping packets, adjusting window size and resending those packets.

Maybe we could make those knobs more available to people wouldn't have to fork Substrate.

bkchr commented 1 year ago

Good points! Was some random idea by me that I just wanted to write down. Hadn't put that much thoughts into it!

Maybe we could make those knobs more available to people wouldn't have to fork Substrate.

Yes! This wasn't really possible up to now, but now with sync almost being independent it should work!

Polkadot-Forum commented 1 year ago

This issue has been mentioned on Polkadot Forum. There might be relevant details there:

https://forum.polkadot.network/t/april-updates-for-substrate-and-polkadot-devs/2764/1

paritytech / substrate

Lost peers due to state-cache #10130