Closed GopherJ closed 1 year ago
I'd really like to know why it's refused? is it because of a bad reputation? how can I check this, thanks
the connection to another three bootnodes is refused:
2021-10-31 02:04:46.972 DEBUG tokio-runtime-worker sync: [Parachain] Connected 12D3KooWLUTzbrJJDowUKMPfEZrDY6eH8HXvm8hrG6YrdUmdrKPz
2021-10-31 02:04:46.973 DEBUG tokio-runtime-worker sync: [Parachain] Request to peer PeerId("12D3KooWLUTzbrJJDowUKMPfEZrDY6eH8HXvm8hrG6YrdUmdrKPz") failed: Refused.
2021-10-31 02:04:46.974 DEBUG tokio-runtime-worker sync: [Parachain] 12D3KooWLUTzbrJJDowUKMPfEZrDY6eH8HXvm8hrG6YrdUmdrKPz disconnected
2021-10-31 02:06:32.190 DEBUG tokio-runtime-worker sync: [Parachain] Connected 12D3KooWA8jSwEbscptbwv1KqY7d7n2qURbd6zUaaPvzTVBMMgSd
2021-10-31 02:06:32.191 DEBUG tokio-runtime-worker sync: [Parachain] Request to peer PeerId("12D3KooWA8jSwEbscptbwv1KqY7d7n2qURbd6zUaaPvzTVBMMgSd") failed: Refused.
2021-10-31 02:06:32.191 DEBUG tokio-runtime-worker sync: [Parachain] 12D3KooWA8jSwEbscptbwv1KqY7d7n2qURbd6zUaaPvzTVBMMgSd disconnected
2021-10-31 02:07:03.711 DEBUG tokio-runtime-worker sync: [Parachain] Connected 12D3KooWL63x8ZPkY2ZekUqyvyNwsakwbuy8Rq3Dt9tJcxw5NFTt
2021-10-31 02:07:03.810 DEBUG tokio-runtime-worker sync: [Parachain] Request to peer PeerId("12D3KooWL63x8ZPkY2ZekUqyvyNwsakwbuy8Rq3Dt9tJcxw5NFTt") failed: Refused.
but seems no reason was given
@tomaka sorry to ping you, we would like to fix this issue as it's on live, do you mind having a look?
just discoverred it's back now with 17 peers so this error happened for a few hours
this error happened again on one of our collator: https://telemetry.polkadot.io/#list/0x64a1c658a48b2e70a7fb1ad4c39eea35022568c20fc44a6e2e3d0a57aee6053b
The Refused
error happens if the reputation of the source node is too low, or if the node targeted by the request is overloaded and is uncapable of processing more block requests (to prevent DoS attacks)
The
Refused
error happens if the reputation of the source node is too low, or if the node targeted by the request is overloaded and is uncapable of processing more block requests (to prevent DoS attacks)
How to check the reputation? The latter shouldn't be our case because it's our own node and we didn't find any potential DDOS attack yet.
You can use -l peerset=trace
to see all reputation changes. If there's no error in the logs, it is likely that this doesn't relate to reputations. Instead, chances are that the CPU or disk the nodes are running on isn't powerful enough.
ok let me test
I saw this, from this disconnected collator's perspective, the others have a pretty bad reputation
If I check this reputation from other nodes' perspective, this disconnected node has 0 as reputation
Reputations start being "bad" below approximately -2^28
if I remember correctly.
The reputations displayed here shouldn't have any consequence.
@tomaka so you are saying that we should increase VPS's size right. Currently we are using 8Cores + 16GB and 10Gbps for our collators, do you have suggestions?
This should be more than enough, as far as I know.
In my opinion, something (i.e. some event) happened and caused the nodes to act very slowly for a few hours. It would be nice to figure out what this "something" is, but we don't have any clue here unfortunately. Getting a more powerful machine would mitigate the problem if/when it happens again.
@tomaka ok we will investigate it and let you know, great thanks on the help
@tomaka we didn't find any attack, so I believe it's ok on this part, also we don't have any error logs yet, but we constantly have one collator disconnected with the others
https://telemetry.polkadot.io/#/0x64a1c658a48b2e70a7fb1ad4c39eea35022568c20fc44a6e2e3d0a57aee6053b
this time it's heiko-collator-0
hi @tomaka I think there is a bug, restarting the nodes can help solving this for some hours but then it starts to lose peers again.
do you have some advice that we can apply, util now we don't have any error logs, the number of peers just go down periodically
this really doesn't look cool, what type of information can I post to help fixing this issue?
Please check the CPU usage of the nodes and see if it correlates with the disconnections.
@tomaka
I think it looks totally fine. Actually we didn't find anything strange except Request to peerid failed: Refused
hi @tomaka we are still suffering from this issue, we are restarting our collators everyday:) it helps but not much
https://telemetry.polkadot.io/#/0x64a1c658a48b2e70a7fb1ad4c39eea35022568c20fc44a6e2e3d0a57aee6053b
as you can see from the following, heiko-collator-4
and heiko-collator-7
got refused by the others today
I tried to use reserved-nodes
but it's also disconnecting no matter what I do
after adding --state-cache-size 1
it looks much better now but not sure yet if it's resolved
since these three days our collators are working fine by adjusting --state-cache-size
to 1
, I think this issue can be closed
I see the same issue on my network with a few nodes. There was nothing special happening there, node was working fine for a few days and suddenly that happened. Shouldn't it "just work" without need to disable state cache?
I think this should be reopened for further investigation, there is clearly something wrong if it happens on simple network with no misbehavior.
@tomaka @bkchr @nazar-pc I'm reopening this because there are definitely some issues with the cache.
With what cache? The state cache?
We have a simple and reliable way to reproduce it (though inconvenient).
With blockchain size (chains
directory) of ~200G and 2 nodes available, third node joining the network is fairly quickly banned by other nodes. This happens every time long before 100G mark. Last attempt failed after just a few gigabytes.
Are there some parameters right now that can be tweaked to fix this behavior?
@nazar-pc is this still relevant?
Likely yes. What we did on our side is forking Substrate to decrease amount of data that block range response can contain and together with https://github.com/paritytech/substrate/pull/12146 it got much better, but I don't believe it is solved fundamentally.
What limit did you decrease?
CC @altonen the title here is probably misleading at this should not be related to the state-cache (which also not exists anymore today), but maybe some networking issue when it is improved by decreasing the mentioned max body size.
@nazar-pc could you provide your mentioned reproduction?
I'm afraid it is no longer there as we are in the middle between previous and current testnet right now. I can request the team to try and reproduce it, but it may take a while.
The gist of our observation is the following: node was making request, but response was so big (initially significantly bigger than default MAX_BODY_SIZE
due to old implementation logic) that node wasn't able to always pull it within keepalive timeout (I believe this is how it is called). With https://github.com/paritytech/substrate/pull/12146 and MAX_BODY_BYTES
decrease we achieved lower maximum expected bandwidth requirement from nodes before timeout kicks in and that way connection wasn't dropped. So the peers are connected to each other and they successfully transfer data, but because a single response takes so long (it shouldn't block TCP connection though because messages are fragmented by muxer anyway) it was having such effect as if connection was idling for a long time and timing out.
This was primarily reproducible in slow network conditions, like pulling something behind Chinese firewall from server in EU as one of easier to reproduce examples. Those who had strong connectivity didn't have issues, but we are supporting diverse set of users in our protocol and need to be prepared to these less than optimal conditions.
I can request the team to try and reproduce it, but it may take a while.
This would be much appreciated. @dmitry-markin @melekes could you take a look at this?
So the peers are connected to each other and they successfully transfer data, but because a single response takes so long (it shouldn't block TCP connection though because messages are fragmented by muxer anyway) it was having such effect as if connection was idling for a long time and timing out.
Ty! Yeah maybe our time out mechanism isn't implemented properly. I don't know how it works exactly, but it should detect that there is progress being made and not require some kind of ping-pong (I think that is what we are doing) in these cases.
I can request the team to try and reproduce it, but it may take a while.
This would be much appreciated. @dmitry-markin @melekes could you take a look at this?
I can have a look at this and try reproducing the issue with timeouts on slow networks after merging some in-progress PRs.
May be connected issue: https://github.com/paritytech/substrate/issues/12105
So far it looks like the theoretical bandwidth requirements based on MAX_BODY_BYTES
of 8 MB and block request protocol timeout set to 20 s is 8 MB / 20 s = 3.4 Mbit/s. In my (arbitrary) tests with syncing polkadot chain, 1 Mbit/s was enough to download blocks around block number 14357815
, but 512 kbit/s connection limit lead to block request protocol timing out after 20 seconds, and the peer we are syncing from being disconnected and backed-off. The things can be theoretically improved client-side by reducing the number of blocks requested in a single request in MAX_BLOCKS_TO_REQUEST
. Reducing it from 64 to 32 made sync work in my setup on 512 kbit/s.
It looks like there is no solution other than "turning knobs here and there" without modifying libp2p, because libp2p considers the whole request to be timed out if the response was not received in full. On the other hand, I'm not sure that keeping connections alive if they can't meet our timeout requirements and just transmit something really makes sense (e.g., slowloris attack?)
Yes, we have decreased requirements in our protocol to ~1Mbps, unfortunate things is that those are constants and we had to fork Substrate to make those changes.
It looks like there is no solution other than "turning knobs here and there" without modifying libp2p, because libp2p considers the whole request to be timed out if the response was not received in full. On the other hand, I'm not sure that keeping connections alive if they can't meet our timeout requirements and just transmit something really makes sense (e.g., slowloris attack?)
But could we not be smarter by starting with less blocks, calculate the bandwidth and then increase the blocks to request based on what we measured? We would then only need some minimum bandwidth requirements.
But could we not be smarter by starting with less blocks, calculate the bandwidth and then increase the blocks to request based on what we measured? We would then only need some minimum bandwidth requirements.
In theory we could, but adaptive congestion control on the application layer would be more complicated than just calculating the bandwidth once and setting the block count in the request. We should also account for the change of network conditions, influence of parallel transmission of data by other protocols / parallel download from multiple peers (not sure if we have this), control stability issues. I.e., the congestion control algorithms in TCP are there for a reason.
We can implement some POC, but the algorithm must be extremely dumb and straightforward to not introduce new issues (especially I worry for the stability of the feedback loop).
I agree with @dmitry-markin here
There's already so much abstraction below the request-response protocol that implementing adaptive streaming all the way up there sounds quite difficult to get working properly, except for the dumb case. The request-response protocol would have imperfect information as to what is happening in the stream because other protocols are also being multiplexed and consume some amount of the available bandwidth. Also once the amount of bandwidth is over-estimated and too much data is sent, the connection is closed which is quite a bit more harsh than TCP dropping packets, adjusting window size and resending those packets.
Maybe we could make those knobs more available to people wouldn't have to fork Substrate.
Good points! Was some random idea by me that I just wanted to write down. Hadn't put that much thoughts into it!
Maybe we could make those knobs more available to people wouldn't have to fork Substrate.
Yes! This wasn't really possible up to now, but now with sync almost being independent it should work!
This issue has been mentioned on Polkadot Forum. There might be relevant details there:
https://forum.polkadot.network/t/april-updates-for-substrate-and-polkadot-devs/2764/1
After upgrading to polkadot-v0.9.10 & polkadot-v0.9.11, we start to observe
no peers
issue on some of our collators, while turning onsync=trace
I found the following logs: