mimblewimble / grin

Minimal implementation of the Mimblewimble protocol.
https://grin.mw/
Apache License 2.0
5.04k stars 991 forks source link

Healthy peers ratio very low #1464

Closed garyyu closed 6 years ago

garyyu commented 6 years ago

In the beginning, I have 106/172 healthy peers:

Sep 02 11:55:55.068 DEBG monitor_peers: 0 connected (0 most_work). all 0 = 0 healthy + 0 banned + 0 defunct
Sep 02 11:55:56.069 DEBG monitor_peers: 2 connected (1 most_work). all 115 = 115 healthy + 0 banned + 0 defunct
Sep 02 11:56:29.974 DEBG monitor_peers: 0 connected (0 most_work). all 115 = 115 healthy + 0 banned + 0 defunct
Sep 02 11:56:30.980 DEBG monitor_peers: 2 connected (1 most_work). all 115 = 115 healthy + 0 banned + 0 defunct
Sep 02 11:56:31.988 DEBG monitor_peers: 3 connected (2 most_work). all 115 = 110 healthy + 0 banned + 5 defunct
Sep 02 11:56:32.993 DEBG monitor_peers: 6 connected (3 most_work). all 172 = 163 healthy + 0 banned + 9 defunct
Sep 02 11:56:53.038 DEBG monitor_peers: 9 connected (6 most_work). all 172 = 121 healthy + 0 banned + 51 defunct
Sep 02 11:57:13.095 DEBG monitor_peers: 10 connected (7 most_work). all 172 = 106 healthy + 0 banned + 66 defunct

For somehow reason, this ratio was decreasing from time to time until 25/214:

Sep 02 22:10:03.031 DEBG monitor_peers: 0 connected (0 most_work). all 214 = 109 healthy + 0 banned + 105 defunct
Sep 02 22:10:23.101 DEBG monitor_peers: 17 connected (10 most_work). all 214 = 109 healthy + 0 banned + 105 defunct
Sep 02 22:10:43.164 DEBG monitor_peers: 17 connected (10 most_work). all 214 = 88 healthy + 0 banned + 126 defunct
...
Sep 02 22:11:54.044 DEBG monitor_peers: 0 connected (0 most_work). all 214 = 88 healthy + 0 banned + 126 defunct
Sep 02 22:12:14.101 DEBG monitor_peers: 17 connected (10 most_work). all 214 = 88 healthy + 0 banned + 126 defunct
Sep 02 22:12:34.154 DEBG monitor_peers: 17 connected (10 most_work). all 214 = 67 healthy + 0 banned + 147 defunct
...
Sep 02 22:24:58.669 DEBG monitor_peers: 17 connected (10 most_work). all 214 = 67 healthy + 0 banned + 147 defunct
Sep 02 22:25:18.744 DEBG monitor_peers: 19 connected (11 most_work). all 214 = 51 healthy + 0 banned + 163 defunct
...
Sep 02 22:31:01.997 DEBG monitor_peers: 0 connected (0 most_work). all 214 = 51 healthy + 0 banned + 163 defunct
Sep 02 22:31:22.040 DEBG monitor_peers: 19 connected (11 most_work). all 214 = 51 healthy + 0 banned + 163 defunct
Sep 02 22:31:42.097 DEBG monitor_peers: 19 connected (11 most_work). all 214 = 33 healthy + 0 banned + 181 defunct
...
Sep 02 23:16:47.186 DEBG monitor_peers: 19 connected (11 most_work). all 214 = 33 healthy + 0 banned + 181 defunct
Sep 02 23:17:07.224 DEBG monitor_peers: 19 connected (11 most_work). all 214 = 25 healthy + 0 banned + 189 defunct
...
Sep 03 07:38:40.448 DEBG monitor_peers: 17 connected (1 most_work). all 214 = 25 healthy + 0 banned + 189 defunct
Sep 03 07:39:00.493 DEBG monitor_peers: 17 connected (1 most_work). all 214 = 25 healthy + 0 banned + 189 defunct
Sep 03 07:39:20.531 DEBG monitor_peers: 17 connected (1 most_work). all 214 = 25 healthy + 0 banned + 189 defunct

will check the root cause later.

antiochp commented 6 years ago

I think peers are "healthy" until we know otherwise - so the number will start off high (and stay high) if we successfully peer with a decent number of active peers.

We're only going to tag peers as "defunct" if we try and peer with them and fail for some reason (and we won't attempt that if we already have 8 active peers, for example).

Doesn't necessarily explain why you're seeing this, but if a bunch of active peers were to drop, forcing you to go find some others, you may see this behavior.

ignopeverell commented 6 years ago

I think the fact that most peers don't have the p2p port open would explain this. They can connect to you, so they're marked healthy. But next time you try to connect to them, you won't be able to, so they're marked unhealthy.

garyyu commented 6 years ago

I start to check this problem today, things become worse in the latest version, always only 7 connected peers:

Sep 10 23:34:00.648 DEBG monitor_peers: on 0.0.0.0:13414, 7 connected (5 most_work). all 75 = 7 healthy + 0 banned + 68 defunct
Sep 10 23:34:00.648 DEBG monitor_peers: 0.0.0.0:13414 ask 94.130.64.25:13414 for more peers
Sep 10 23:34:00.649 DEBG monitor_peers: 0.0.0.0:13414 ask 109.74.202.16:13414 for more peers
Sep 10 23:34:00.650 DEBG monitor_peers: 0.0.0.0:13414 ask 95.216.163.175:13414 for more peers
Sep 10 23:34:00.651 DEBG monitor_peers: 0.0.0.0:13414 ask 198.245.50.26:13414 for more peers
Sep 10 23:34:00.651 DEBG monitor_peers: 0.0.0.0:13414 ask 165.227.109.134:13414 for more peers
Sep 10 23:34:00.652 DEBG monitor_peers: 0.0.0.0:13414 ask 108.196.200.233:13414 for more peers
Sep 10 23:34:00.653 DEBG monitor_peers: 0.0.0.0:13414 ask 52.23.180.2:13414 for more peers
Sep 10 23:34:00.653 DEBG monitor_peers: no preferred peers

Definitely something is wrong.

garyyu commented 6 years ago

More tests:

  1. Install a new node in a Linode US cloud host, also only have 8 connected peers. One issue found:
    Sep 11 07:04:11.647 INFO Received block headers [0a11a5a6, 05eb1a8a, 04c4a8f8, 0001a85d, 0cb567cf, 00288203, 0291c29b, 07b30c30] from 108.196.200.233:13414
    Sep 11 07:04:11.653 INFO Received block headers [024ce368, 08237a3e, 02dd8ef4, 05e63cca, 0605eaf0, 0aca4af4] from 108.196.200.233:13414
    Sep 11 07:04:11.659 INFO Received block headers [058a107d] from 108.196.200.233:13414
    ...
    Sep 11 07:04:52.211 DEBG Client 108.196.200.233:13414 connection lost: Connection(Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" })

    It's obvious that 108.196.200.233 should be a healthy peer, since we can download headers successfully from it. A ConnectionReset will change this peer state as State::Disconnected, and it should be Ok to reconnect it in next loop (20s).

    fn check_connection(&self) -> bool {
        match self.connection.as_ref().unwrap().error_channel.try_recv() {
            Ok(Error::Serialization(e)) => {
                ...
                false
            }
            Ok(e) => {
                let mut state = self.state.write().unwrap();
                *state = State::Disconnected;
                               <<<< here!
                debug!(LOGGER, "Client {} connection lost: {:?}", self.info.addr, e);
                false
            }
            Err(_) => true,
        }
    }

The problem is: once we set a peer state as Disconnected, we will never try to connect it again:

fn monitor_peers(...){
    ...
    // find some peers from our db
    // and queue them up for a connection attempt
    let new_peers = peers.find_peers(
        p2p::State::Healthy,        <<<< Here!  We only try to reconnect those Healthy state peers.
        p2p::Capabilities::UNKNOWN,
        config.peer_max_count() as usize,
    );
    for p in new_peers.iter().filter(|p| !peers.is_known(&p.addr)) {
        ...
        tx.send(p.addr).unwrap();
    }

And root cause addressed here: In peer.rs, we have a independent and different State definition from p2p/src/store.rs which is used for seed.rs.

in p2p/src/store.rs:

    pub enum State {
        Healthy = 0,
        Banned = 1,
        Defunct = 2,
    }

in p2p/src/peer.rs:

enum State {
    Connected,
    Disconnected,
    Banned,
}

That means Disconnected will be mapped as Banned, and Banned will be mapped as Defunct. What a mess!

ignopeverell commented 6 years ago

Quite a mess indeed! Those should be strongly typed so they can't be mixed up. And probably reconciled in a single enum. Thanks for tracking this down!

garyyu commented 6 years ago

Continue above analysis:

Sorry, the root cause addressed above is not accurate:(
After further reading, I find there's no direct mapping between State in peer.rs and State in p2p/src/store.rs.

After a peer is set as Disconnected state or even Banned state, currently there's no any impact on the store state. means when peer is Disconnected, peer store state is Healthy.

That's correct behavior, but the side effect is: this Disconnected peer is still in the active peers list, and there's no more processing for it, and it will stay forever at that Disconnected state.

will try to give a PR to fix this.

[update at 1 day later]: It's a bug in function peer_clean(), refer to the detail info in https://github.com/mimblewimble/grin/pull/1505#issuecomment-420275872

garyyu commented 6 years ago

The partial fix is ready in https://github.com/mimblewimble/grin/pull/1505 and I wrote the comment there.

Will continue to check the remaining issues, after related PR #1505 and PR https://github.com/mimblewimble/grin/pull/1503 finish review and merging.

garyyu commented 6 years ago

So far all the issues I have seen have been fixed in #1503, #1505 and #1513, so I close this issue.