Closed garyyu closed 6 years ago
I think peers are "healthy" until we know otherwise - so the number will start off high (and stay high) if we successfully peer with a decent number of active peers.
We're only going to tag peers as "defunct" if we try and peer with them and fail for some reason (and we won't attempt that if we already have 8 active peers, for example).
Doesn't necessarily explain why you're seeing this, but if a bunch of active peers were to drop, forcing you to go find some others, you may see this behavior.
I think the fact that most peers don't have the p2p port open would explain this. They can connect to you, so they're marked healthy. But next time you try to connect to them, you won't be able to, so they're marked unhealthy.
I start to check this problem today, things become worse in the latest version, always only 7 connected peers:
Sep 10 23:34:00.648 DEBG monitor_peers: on 0.0.0.0:13414, 7 connected (5 most_work). all 75 = 7 healthy + 0 banned + 68 defunct
Sep 10 23:34:00.648 DEBG monitor_peers: 0.0.0.0:13414 ask 94.130.64.25:13414 for more peers
Sep 10 23:34:00.649 DEBG monitor_peers: 0.0.0.0:13414 ask 109.74.202.16:13414 for more peers
Sep 10 23:34:00.650 DEBG monitor_peers: 0.0.0.0:13414 ask 95.216.163.175:13414 for more peers
Sep 10 23:34:00.651 DEBG monitor_peers: 0.0.0.0:13414 ask 198.245.50.26:13414 for more peers
Sep 10 23:34:00.651 DEBG monitor_peers: 0.0.0.0:13414 ask 165.227.109.134:13414 for more peers
Sep 10 23:34:00.652 DEBG monitor_peers: 0.0.0.0:13414 ask 108.196.200.233:13414 for more peers
Sep 10 23:34:00.653 DEBG monitor_peers: 0.0.0.0:13414 ask 52.23.180.2:13414 for more peers
Sep 10 23:34:00.653 DEBG monitor_peers: no preferred peers
Definitely something is wrong.
More tests:
Sep 11 07:04:11.647 INFO Received block headers [0a11a5a6, 05eb1a8a, 04c4a8f8, 0001a85d, 0cb567cf, 00288203, 0291c29b, 07b30c30] from 108.196.200.233:13414
Sep 11 07:04:11.653 INFO Received block headers [024ce368, 08237a3e, 02dd8ef4, 05e63cca, 0605eaf0, 0aca4af4] from 108.196.200.233:13414
Sep 11 07:04:11.659 INFO Received block headers [058a107d] from 108.196.200.233:13414
...
Sep 11 07:04:52.211 DEBG Client 108.196.200.233:13414 connection lost: Connection(Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" })
It's obvious that 108.196.200.233
should be a healthy peer, since we can download headers successfully from it. A ConnectionReset
will change this peer state as State::Disconnected
, and it should be Ok to reconnect it in next loop (20s).
fn check_connection(&self) -> bool {
match self.connection.as_ref().unwrap().error_channel.try_recv() {
Ok(Error::Serialization(e)) => {
...
false
}
Ok(e) => {
let mut state = self.state.write().unwrap();
*state = State::Disconnected;
<<<< here!
debug!(LOGGER, "Client {} connection lost: {:?}", self.info.addr, e);
false
}
Err(_) => true,
}
}
The problem is: once we set a peer state as Disconnected
, we will never try to connect it again:
fn monitor_peers(...){
...
// find some peers from our db
// and queue them up for a connection attempt
let new_peers = peers.find_peers(
p2p::State::Healthy, <<<< Here! We only try to reconnect those Healthy state peers.
p2p::Capabilities::UNKNOWN,
config.peer_max_count() as usize,
);
for p in new_peers.iter().filter(|p| !peers.is_known(&p.addr)) {
...
tx.send(p.addr).unwrap();
}
And root cause addressed here: In peer.rs, we have a independent and different State definition from p2p/src/store.rs which is used for seed.rs.
in p2p/src/store.rs:
pub enum State {
Healthy = 0,
Banned = 1,
Defunct = 2,
}
in p2p/src/peer.rs:
enum State {
Connected,
Disconnected,
Banned,
}
That means Disconnected
will be mapped as Banned
, and Banned
will be mapped as Defunct
. What a mess!
Quite a mess indeed! Those should be strongly typed so they can't be mixed up. And probably reconciled in a single enum. Thanks for tracking this down!
Continue above analysis:
Sorry, the root cause addressed above is not accurate:(
After further reading, I find there's no direct mapping between State
in peer.rs and State
in p2p/src/store.rs.
After a peer is set as Disconnected
state or even Banned
state, currently there's no any impact on the store state. means when peer is Disconnected
, peer store state is Healthy
.
That's correct behavior, but the side effect is: this Disconnected
peer is still in the active peers
list, and there's no more processing for it, and it will stay forever at that Disconnected
state.
will try to give a PR to fix this.
[update at 1 day later]:
It's a bug in function peer_clean()
, refer to the detail info in https://github.com/mimblewimble/grin/pull/1505#issuecomment-420275872
The partial fix is ready in https://github.com/mimblewimble/grin/pull/1505 and I wrote the comment there.
Will continue to check the remaining issues, after related PR #1505 and PR https://github.com/mimblewimble/grin/pull/1503 finish review and merging.
So far all the issues I have seen have been fixed in #1503, #1505 and #1513, so I close this issue.
In the beginning, I have 106/172 healthy peers:
For somehow reason, this ratio was decreasing from time to time until 25/214:
will check the root cause later.