Beacon Node: Unable to recover from network fragmentation

sigp / lighthouse

Ethereum consensus client in Rust

https://lighthouse.sigmaprime.io/

Apache License 2.0

2.88k stars 724 forks source link

Beacon Node: Unable to recover from network fragmentation #949

Closed q9f closed 4 years ago

q9f commented 4 years ago

Description

Given is a custom beacon-chain testnet with nodes in physical distinct network locations [A, B]. The nodes have both an ENR supplied in a testnet directory specified as boot-enr.yaml and (due to inability to sufficiently network through ENR) a multi-address command line flag.

The ENR file looks like this: gist/d6eea3ea3356e41bde81864143284ce9#file-4-boot_enr-yaml

The multi-address look like this:

--libp2p-addresses /ip4/51.158.190.99/tcp/9000,/ip4/87.180.203.227/tcp/9000

Version

~/.opt/lighthouse master*
❯ git log -n1
commit f6a6de2c5d9e6fe83b6ded24bad93615f98e2163 (HEAD -> master, origin/master, origin/HEAD)
Author: Sacha Saint-Leger <sacha.saint-leger@mail.mcgill.ca>
Date:   Mon Mar 23 09:21:53 2020 +0100

    Become a Validator guides: update (#928)

~/.opt/lighthouse master*
❯ rustc --version
rustc 1.42.0 (b8cedc004 2020-03-09)

Present Behaviour

In case there is a network fragmentation between A and B, the nodes do not attempt to reconnect to each other. Both nodes know about each other both from the ENR and the multi-address format.

Furthermore, if both A and B are validators, there happens a chain split with two different head slots.

It's possible to reconnect the nodes by restarting the beacon chain nodes manually, however, the chains of A and B are now unable to reorganize to the best head and the peers ban each other.

Mar 23 10:53:27.534 ERRO Disconnecting and banning peer          timeout: 30s, peer_id: PeerId("16Uiu2HAkxE6kBjfoGtSfhSJAE8oib6h3gM972pAj9brmthTHuuP2"), service: network
Mar 23 10:53:27.612 ERRO Disconnecting and banning peer          timeout: 30s, peer_id: PeerId("16Uiu2HAmPz5xFwZfY4CYN6fxf3Yz6LQaDfzCUrA5qwCoTHoCSiNR"), service: network
Mar 23 10:53:38.000 WARN Low peer count                          peer_count: 1, service: slot_notifier

In this case, deleting either A's or B's beacon chain and temporarily stopping the validators is the only way to recover from this issue.

Expected Behaviour

The beacon chain node should aggressively try to maintain connections. It should keep trying to resolve ENR and multi-addresses even after a disconnect.

I don't know how it's designed, but I imagine:

having higher timeouts to prevent disconnects in case of short network fragmentation
having repeated connection attempts even if previous attempts failed in case of longer or more severe network fragmentation

I have not enough understanding of consensus to have a suggestion for how to handle the reorganization, but having a more stable network would certainly help here.

Steps to resolve

Manual restart including reset of the beacon chain data directory.

AgeManning commented 4 years ago

Hey, thanks for registering this.

The source of this problem lies with the CLI ergonomics and lack of documentation around ENR's. I know this is an issue because even our internal team make mistakes with this.

There is a branch which contains a variety of network update features (not merged to master yet). Among them I've adjusted the mechanics around ENR creation and log statements when they are created to avoid some of the confusion. I'll also endeavour to add more documentation around ENR creation.

The problem you are having is due to incorrect ENR's (I've made a PR for a spec change to help address this). When you start a node, we need to put an address into an ENR. If you specify the listening address to 0.0.0.0, this address is not valid to put into an ENR as no one can connect back to you on 0.0.0.0. The spec's currently say that we must specify an IP field, so we simply put localhost until the ENR's find peers and update this field themselves. In this case we log a warning, that this is being done.

What is happening, is that the ENR's you have put as boot-nodes have an IP address of 127.0.0.1, so any node you direct to connect to it, will fail. (I've built a simplistic tool to read ENR's here: https://crates.io/crates/enr-cli which you should able to install via cargo install enr-cli --version 0.1.0-alpha ).

Because the ENR's are invalid, they are not entering your routing table and therefore when your node tries to find new nodes, no new nodes are found and you don't see the re-connections you are expecting.

We should not see the banning of nodes, unless they are on completely different forks. Perhaps the genesis time has changed when you restarted nodes.

With the current master branch, the solution to your problem is to use the --discovery-address CLI param to specify the external IP address that you want other nodes to connect to. This will create valid ENR's (globally) which you can check with the enr-cli tool. Adding these ENR's to your yaml file and removing the --libp2p-addresses flag, should give you the stable network you are chasing.

paulhauner commented 4 years ago

So after chatting to @agemanning, it seems I shouldn’t have recommended the libp2p-addresses flag since it will only try to connect once on boot then never attempt again. Specifying the ENRs is indeed required.

Perhaps we could address this issue by adding some help text to the libp2p-addresses flag describing that it’s only for debugging purposes and is not reliable for production.

AgeManning commented 4 years ago

Yep libp2p-addresses was originally designed only for interacting with clients that didn't have discv5 support during interop. It should be removed in the future, but it's currently useful for debugging and testing libp2p. Agree we should add a big warning, for that CLI param (along with the random propagation cli flag :) )

q9f commented 4 years ago

I recently posted this: https://ethereum.stackexchange.com/questions/80668/how-to-retrieve-ip-address-and-port-from-ethereum-node-records-enr

Quote:

"How do I know that this format is using my node's public IP address and not the local or private IP address?"

Why on earth would we encode addresses that way? I've been doing blockchain operations for years now and being able to connect two nodes with each other is one of the most essential tasks. ENR does not allow me to inspect the IP address and port easily, let alone modify it.

Specifying the ENRs is indeed required.

But I specified the ENR. I just have no idea what the ENR contains, so I cannot validate easily if it works. Since it does not work, I added the multi-address format.

It should be removed in the future, but it's currently useful for debugging and testing libp2p.

Right, it's useful, so let's remove it in favor for ... what? ENR? We have a massive operational issue here, and sorry for my rant as this is not related to Lighthouse - but if I need a third party tool to generate and verify or modify the ENR - what's the point? People will keep using multi-address because it does the job.

With the current master branch, the solution to your problem is to use the --discovery-address CLI param to specify the external IP address that you want other nodes to connect to.

This is a valid answer to my ticket and I'm happy to close this. But please consider allowing people to use more human-friendly ways to connect nodes to each other. Don't remove multi-address support. Why not treat libp2p multi-addresses the same as ENR?

Because the ENR's are invalid, they are not entering your routing table and therefore when your node tries to find new nodes, no new nodes are found and you don't see the re-connections you are expecting.

This worries me most. Even though all 4 nodes I have are successfully connected to each other, they fail to communicate their address to each other.

In Ethereum 1.0, we have the node public key (enode) and we know about their port and IP once they are connected to each other. Apparently, with ENR this appears not to work at as you cannot just update the node record with the IP you actually see from outside....

AgeManning commented 4 years ago

Why on earth would we encode addresses that way? I've been doing blockchain operations for years now and being able to connect two nodes with each other is one of the most essential tasks. ENR does not allow me to inspect the IP address and port easily, let alone modify it.

In the updated branch, we print the IP and port along with the base64 encoding so you know what's in the ENR.

Right, it's useful, so let's remove it in favor for ... what? ENR? We have a massive operational issue here, and sorry for my rant as this is not related to Lighthouse - but if I need a third party tool to generate and verify or modify the ENR - what's the point? People will keep using multi-address because it does the job.

The reason this needs to be removed is precisely because multiaddrs do not do the required job. Multiaddrs are libp2p specific and do not provide the information that ENR's do. Specifically, given a multiaddr we cannot add a peer to our discovery mechanism (discv5) which strictly requires the peer's ENR. Connecting to the libp2p port alone, ignoring discovery entirely, is the reason we see the network issues you are experiencing.

As I mentioned, I think more verbose logging and making ENR fields more obvious may help with this.

This is a valid answer to my ticket and I'm happy to close this. But please consider allowing people to use more human-friendly ways to connect nodes to each other. Don't remove multi-address support. Why not treat libp2p multi-addresses the same as ENR?

The simple answer to this, is that we can't. There is a small description here: https://github.com/ethereum/eth2.0-specs/blob/dev/specs/phase0/p2p-interface.md#what-is-the-difference-between-an-enr-and-a-multiaddr-and-why-are-we-using-enrs

An ENR can map to many or none multiaddrs. A multiaddr quite simply cannot participate in our discovery mechanism and therefore should not be used to address a peer.

This worries me most. Even though all 4 nodes I have are successfully connected to each other, they fail to communicate their address to each other.

I have a specific issue in the spec to resolve this and a draft PR with solutions: https://github.com/ethereum/eth2.0-specs/issues/1637

In Ethereum 1.0, we have the node public key (enode) and we know about their port and IP once they are connected to each other. Apparently, with ENR this appears not to work at as you cannot just update the node record with the IP you actually see from outside....

I haven't been following eth 1.0 updates closely, but I'm told it's transitioning to discv5 also, and therefore will also be using ENRs. Our discovery mechanism does auto-update its external IP if enough connected peers report back the same external IP. The manual intervention of modifying an ENR, isn't as straight forward as a multiaddr because an ENR needs to be signed, and so there is an associated keypair that needs to perform a modification. This is why it's kind of hidden away from the user behind CLI params in lighthouse.

Ignoring the auto-update feature (which requires connected peers), we currently support the option for users to enter their own external-ip via CLI flag to insert into the ENR and we will be adding UPnP support, which if enabled, can obtain the external IP from a local gateway and add the appropriate port-forwards.

Of course we are open to any further suggestions you may have to make this more user-friendly

AgeManning commented 4 years ago

I might also elaborate on the general design/architecture as to why this is the case.

The idea is to build a robust DHT of known peers. Lighthouse regularly crawls the DHT for more peers until it reaches a max peer count. If a peers disconnect, typically we re-connect with other peers from our DHT which have certain properties that are useful to us.

In order for a peer to participate in the DHT, it needs to be referenced by a contactable ENR. In the current design, the idea is that there are a number of correctly established bootnodes (with their ENR's already set to contactable addresses). When a new node connects to them, they report back the external IP and a new node will update its external IP automatically (if it is not behind a NAT). Once updated, the boot-nodes and any other node it touches via discovery will add it to the DHT. Everyone becomes contactable and searchable. This process is bypassed when a node connects directly via tcp / multiaddr to a node.

With the spec updates I mentioned, it will be possible to retrieve a connecting node's ENR via a libp2p protocol, however we still won't add that node into the routing table if the ENR doesn't have a contactable address (i.e behind a NAT).

Once the boot-nodes are set up correctly the network should in principle work auto-magically, and peers will re-connect if the NAT permits it. ENR's should be able to remain relatively opaque, but I guess anytime we print a base64 encoded version, we should also print useful details contained in the record.

q9f commented 4 years ago

Thanks for taking the time to elaborate. I used --discovery-address and extracted the correct ENR for the boot node records. Magically, the fork on node B reorganized now. I consider this resolved.

Thanks.

q9f commented 4 years ago

It does not work, unfortunately.

I updated my ENR to

- "enr:-Iu4QJzdLKtWzxA9enEFdn2v7TShRL_ij01vKPjfBvRJl23EFB33IgkXpdBsd4GTg6aPCmnNLVKQoqpAnh7BqaeXQpICgmlkgnY0gmlwhFe0y-OJc2VjcDI1NmsxoQMvVWd9fOrtPlcuw2i52at9VTUkPVHUdJt7Zgs2BU7oEoN0Y3CCIyiDdWRwgiMo"
- "enr:-Iu4QFIChSubIhsU_Ny2gIDtcobF3BaN9VWeR5LanY8J4RoMGEBpVtt-6IyE1hGXYObJTyrgE7p2hsLFML33Zo-30_UCgmlkgnY0gmlwhFe0y-OJc2VjcDI1NmsxoQIk8bKsOD7zqLZqDK9sxrr1bX68N1kzDSEv6M-SBkFRa4N0Y3CCJRyDdWRwgiUc"
- "enr:-Iu4QC7Ttrerniws6O8Q_Ab0_x4NnCymMtdspqRj2obN9oEcA4iRCd70KsmWTM3NoSoGolCqFC8Df9gLuz3cGge6YhACgmlkgnY0gmlwhDOevmOJc2VjcDI1NmsxoQOoVO1B_OdKyErTSiDxCsnZLjYqHmYn8fQT-Cl_6wc49oN0Y3CCIyiDdWRwgiMo"
- "enr:-Iu4QCsoQlhtVe9yQcGB9xBjfW-XC0u3JQGO9zYzTJfmgfmicWUnv3i0zGmewkCU2MvQDZTStkMIAJ-WmScN710UD80CgmlkgnY0gmlwhDOevmOJc2VjcDI1NmsxoQIpnBfn2pHx07D2PK_RHRUtWP35igw6QMdWXFsnu3VDLYN0Y3CCJRyDdWRwgiUc"

Which should contain correct IP addresses for discovery. But apparently, there was another network fragmentation which the client was unable to solve itself.

enr-cli -e enr:-Iu4QJzdLKtWzxA9enEFdn2v7TShRL_ij01vKPjfBvRJl23EFB33IgkXpdBsd4GTg6aPCmnNLVKQoqpAnh7BqaeXQpICgmlkgnY0gmlwhFe0y-OJc2VjcDI1NmsxoQMvVWd9fOrtPlcuw2i52at9VTUkPVHUdJt7Zgs2BU7oEoN0Y3CCIyiDdWRwgiMo
enr-cli -e enr:-Iu4QFIChSubIhsU_Ny2gIDtcobF3BaN9VWeR5LanY8J4RoMGEBpVtt-6IyE1hGXYObJTyrgE7p2hsLFML33Zo-30_UCgmlkgnY0gmlwhFe0y-OJc2VjcDI1NmsxoQIk8bKsOD7zqLZqDK9sxrr1bX68N1kzDSEv6M-SBkFRa4N0Y3CCJRyDdWRwgiUc
enr-cli -e enr:-Iu4QC7Ttrerniws6O8Q_Ab0_x4NnCymMtdspqRj2obN9oEcA4iRCd70KsmWTM3NoSoGolCqFC8Df9gLuz3cGge6YhACgmlkgnY0gmlwhDOevmOJc2VjcDI1NmsxoQOoVO1B_OdKyErTSiDxCsnZLjYqHmYn8fQT-Cl_6wc49oN0Y3CCIyiDdWRwgiMo
enr-cli -e enr:-Iu4QCsoQlhtVe9yQcGB9xBjfW-XC0u3JQGO9zYzTJfmgfmicWUnv3i0zGmewkCU2MvQDZTStkMIAJ-WmScN710UD80CgmlkgnY0gmlwhDOevmOJc2VjcDI1NmsxoQIpnBfn2pHx07D2PK_RHRUtWP35igw6QMdWXFsnu3VDLYN0Y3CCJRyDdWRwgiUc

ENR Read
Sequence No: 2
Node ID: 0x774d..5463
IP: 87.180.203.227
TCP Port: 9000
UDP Port: 9000

ENR Read
Sequence No: 2
Node ID: 0x3081..ff52
IP: 87.180.203.227
TCP Port: 9500
UDP Port: 9500

ENR Read
Sequence No: 2
Node ID: 0xe097..cdd8
IP: 51.158.190.99
TCP Port: 9000
UDP Port: 9000

ENR Read
Sequence No: 2
Node ID: 0x6977..aa1b
IP: 51.158.190.99
TCP Port: 9500
UDP Port: 9500

So, in the end, my peers were unable to discover each other and validators on chains A and B forked off, unable to reorganize:

Mar 24 10:03:09.706 INFO Syncing state updated                   new_state: Syncing, old_state: Regular, service: sync, service: network
Mar 24 10:03:10.304 WARN Parent block is unknown                 baby_block_slot: 19904, parent_root: 0x61b2…47d3, service: sync, service: network
Mar 24 10:03:10.304 WARN Batch processing failed                 peer: 16Uiu2HAmFqm3XZ8AcT1xGzrGFBWgQbzX28WpFrY7YetCy4XzR3Ny, id: 1, service: sync, service: network
Mar 24 10:03:11.148 WARN Parent block is unknown                 baby_block_slot: 19904, parent_root: 0x61b2…47d3, service: sync, service: network
Mar 24 10:03:11.153 WARN Batch processing failed                 peer: 16Uiu2HAmFqm3XZ8AcT1xGzrGFBWgQbzX28WpFrY7YetCy4XzR3Ny, id: 1, service: sync, service: network
Mar 24 10:03:11.823 WARN Parent block is unknown                 baby_block_slot: 19904, parent_root: 0x61b2…47d3, service: sync, service: network
Mar 24 10:03:11.823 WARN Batch processing failed                 peer: 16Uiu2HAmFqm3XZ8AcT1xGzrGFBWgQbzX28WpFrY7YetCy4XzR3Ny, id: 1, service: sync, service: network
Mar 24 10:03:12.605 WARN Parent block is unknown                 baby_block_slot: 19904, parent_root: 0x61b2…47d3, service: sync, service: network
Mar 24 10:03:12.606 WARN Batch processing failed                 peer: 16Uiu2HAmFqm3XZ8AcT1xGzrGFBWgQbzX28WpFrY7YetCy4XzR3Ny, id: 1, service: sync, service: network
Mar 24 10:03:13.276 WARN Parent block is unknown                 baby_block_slot: 19904, parent_root: 0x61b2…47d3, service: sync, service: network
Mar 24 10:03:13.276 WARN Batch processing failed                 peer: 16Uiu2HAmFqm3XZ8AcT1xGzrGFBWgQbzX28WpFrY7YetCy4XzR3Ny, id: 1, service: sync, service: network
Mar 24 10:03:13.276 WARN Batch failed to download. Dropping chain and downvoting peers, id: 1, service: sync, service: network
Mar 24 10:03:13.276 WARN Disconnecting peer (RPC)                peer_id: PeerId("16Uiu2HAmFqm3XZ8AcT1xGzrGFBWgQbzX28WpFrY7YetCy4XzR3Ny"), reason: Fault, service: sync, service: network
Mar 24 10:03:13.276 ERRO Disconnecting and banning peer          timeout: 30s, peer_id: PeerId("16Uiu2HAmFqm3XZ8AcT1xGzrGFBWgQbzX28WpFrY7YetCy4XzR3Ny"), service: network
Mar 24 10:03:13.276 INFO Syncing state updated                   new_state: Regular, old_state: Syncing, service: sync, service: network

paulhauner commented 4 years ago

Thanks @q9f. Could you please confirm if the issue you experienced is on master? The batch processing logic (where the error appears to be occurring) has been refactored in the testnet5 branch, it would be useful to know if the error is in the old code or the new code.

Judging from that weird baby_block_slot log, I think it's from master. I recall finding that log weird and removing the "baby".

Whilst we're still in the heavy development phase (i.e., the specification changes monthly) it's not always clear as to which branch one should be using. For now, I would focus your efforts on testnet5 and assume that master is outdated. Apologies if I didn't communicate this earlier.

AgeManning commented 4 years ago

This log is interesting. If I understand correctly, the nodes disconnected and formed two chains, A and B.

They must have then re-connected after a period of time in order to get the logs you posted, which looks like a node from A is trying to sync with a node from B.

This looks like a serious a bug. On the re-connect the two nodes should identify themselves as different chains and you shouldn't have issues with sync like you do here.

It is hard to fully diagnose with the current logs. If at all possible, could you try and reproduce with debug logs (it seems you have made this happen at least twice). Its the --debug-level debug flag before the beacon node sub-command

q9f commented 4 years ago

They must have then re-connected after a period of time in order to get the logs you posted, which looks like a node from A is trying to sync with a node from B.

No, I manually restarted them to force a reconnect with libp2p-addresses as they didn't attempt to network through the given ENR bootnodes.

For now, I would focus your efforts on testnet5 and assume that master is outdated. Apologies if I didn't communicate this earlier.

Yes, I'm using master. I can recompile testnet5 and see if the behavior differs.

q9f commented 4 years ago

Switched to testnet5 branch and nodes A and B do not talk to each other unless I use the multi-address format.

AgeManning commented 4 years ago

Even in the manual-reset case, we shouldn't see the logs you provided. So if you can reproduce, it would be great to get debug logs for that.

In regards to the network re-connection. This is strange. Currently the logic should go:

Nodes connect to each other via ENR
Nodes randomly disconnect (do you see a "Peer Disconnected" log?)
Each node (depending on running time) should search for new peers, no longer than a 2min interval
When searching for peers, should re-connect to peer that was in the original ENR.

This can fail, if the peer doesn't know it's peer has actually disconnected. I'm curious to know if you see a "Peer Disconnected" in the logs

paulhauner commented 4 years ago

Switched to testnet5 branch and nodes A and B do not talk to each other unless I use the multi-address format.

~~We're using ENR connections on testnet5 without issue. @AgeManning any thoughts about why this might be the case?~~

Sorry, was looking at an old page. Needed to refresh.

q9f commented 4 years ago

Nodes connect to each other via ENR

They clearly don't. That's the reason why I asked Paul on Discord to help me with networking and thus he suggested to play around with the libp2p multi-addresses.

Nodes randomly disconnect (do you see a "Peer Disconnected" log?)

There is no disconnect message in the logs. There is not even any hint of a connect. The peer count never goes up. It only goes down at some point.

Each node (depending on running time) should search for new peers, no longer than a 2min interval

That would be nice, how can I support that?

When searching for peers, should re-connect to peer that was in the original ENR.

There is no initial connect, thus no re-connect.

Even in the manual-reset case, we shouldn't see the logs you provided. So if you can reproduce, it would be great to get debug logs for that.

This appears to be a different issue on a different branch, so we should split this out eventually. How do I provide debug logs? With RUST_LOG?

paulhauner commented 4 years ago

I used @q9f's blog post to try and re-create the same testnet. I populated the boot_enr.yaml with the nodes from the above comment and ran the following command using testnet5:

lighthouse --debug-level debug bn --http --testnet-dir ~/.lighthouse/schlesi

I can see the peers being dialed:

Mar 27 00:44:16.317 DEBG Dialing libp2p peer                     address: /ip4/87.180.203.227/tcp/9000, service: network
Mar 27 00:44:16.317 DEBG Dialing libp2p peer                     address: /ip4/87.180.203.227/tcp/9500, service: network
Mar 27 00:44:16.317 DEBG Dialing libp2p peer                     address: /ip4/51.158.190.99/tcp/9000, service: network
Mar 27 00:44:16.317 DEBG Dialing libp2p peer                     address: /ip4/51.158.190.99/tcp/9500, service: network

I can also see at least two of the nodes responding/connecting:

Mar 27 00:44:17.948 DEBG Identified Peer                         protocols: ["/meshsub/1.0.0", "/eth2/beacon_chain/req/status/1/ssz", "/eth2/beacon_chain/req/goodbye/1/ssz", "/eth2/beacon_chain/req/beacon_blocks_by_range/1/ssz", "/eth2/beacon_chain/req/beacon_blocks_by_root/1/ssz", "/ipfs/ping/1.0.0", "/ipfs/id/1.0.0"], observed_address: "/ip4/203.206.0.136/tcp/47740", listening_ addresses: ["/ip4/127.0.0.1/tcp/9500", "/ip4/51.158.190.99/tcp/9500", "/ip4/127.0.0.1/tcp/9500", "/ip4/10.18.252.109/tcp/9500", "/ip6/::1/tcp/9500"], agent_version: Lighthouse/v0.1.0-unstable/x86_64-linux, protocol_version: lighthouse/libp2p, peer: 16Uiu2HAkxE6kBjfoGtSfhSJAE8oib6h3gM972pAj9brmthTHuuP2, service: network
Mar 27 00:44:17.968 DEBG Identified Peer                         protocols: ["/meshsub/1.0.0", "/eth2/beacon_chain/req/status/1/ssz", "/eth2/beacon_chain/req/goodbye/1/ssz", "/eth2/beacon_chain/req/beacon_blocks_by_range/1/ssz", "/eth2/beacon_chain/req/beacon_blocks_by_root/1/ssz", "/ipfs/ping/1.0.0", "/ipfs/id/1.0.0"], observed_address: "/ip4/203.206.0.136/tcp/49894", listening_ addresses: ["/ip4/127.0.0.1/tcp/9000", "/ip4/127.0.0.1/tcp/9000", "/ip4/10.18.252.109/tcp/9000", "/ip6/::1/tcp/9000"], agent_version: Lighthouse/v0.1.0-unstable/x86_64-linux, protocol_version: lighthouse/libp2p, peer: 16Uiu2HAmT1K3qt5UsEn9CQy7LtijZryxi2BjLKUXrqbCqdv7E4kC, service: network

My node recognises them as fully-synced and then start to sync from them. Sync doesn't succeed, logging several of these:

Mar 27 00:44:19.221 DEBG Processing batch                        id: 1, service: sync, service: network
Mar 27 00:44:19.223 WARN Parent block is unknown                 parent_root: 0x9e22…4dba, service: sync, service: network

Eventually we see:

Mar 27 00:44:23.144 WARN Batch failed to download. Dropping chain and downvoting peers, id: 1, service: sync, service: network
Mar 27 00:44:23.144 DEBG Peer downvoted                          peer: PeerId("16Uiu2HAmT1K3qt5UsEn9CQy7LtijZryxi2BjLKUXrqbCqdv7E4kC"), service: sync, service: network
Mar 27 00:44:23.144 WARN Disconnecting peer (RPC)                peer_id: PeerId("16Uiu2HAmT1K3qt5UsEn9CQy7LtijZryxi2BjLKUXrqbCqdv7E4kC"), reason: Fault, service: sync, service: network

paulhauner commented 4 years ago

Notably this all happens prior to the first "Syncing" log, so it appears that there have never been any peers but in fact peers have come and gone:

Mar 27 00:44:30.001 INFO Syncing                                 est_time: --, speed: --, distance: 57522 slots (1 weeks 1 days), peers: 0, service: slot_notifier

paulhauner commented 4 years ago

Here's a copy of the testnet directory for reproduction of the issue:

https://drive.google.com/file/d/1gUxYzUy9UDI9y-PRSdGnXrCiL6vb18Q6/view?usp=sharing

q9f commented 4 years ago

Is there anything else I can provide or does this mean you are able to reproduce the issue?

I used @q9f's blog post

Note, I changed the genesis delay to 160000. See https://gist.github.com/q9f/d6eea3ea3356e41bde81864143284ce9#file-1-schlesi-yaml-L88

paulhauner commented 4 years ago

Is there anything else I can provide or does this mean you are able to reproduce the issue?

I used @q9f's blog post

Note, I changed the genesis delay to 160000. See https://gist.github.com/q9f/d6eea3ea3356e41bde81864143284ce9#file-1-schlesi-yaml-L88

Thanks, I will update on my end. Nothing more required from me :)

AgeManning commented 4 years ago

I had a look over @paulhauner logs.

They indicate that the newly started node (Pauls) is trying to connect to a chain that has a different genesis root. The behaviour of not finding the parent root and eventually kicking the peer is the desired behaviour as these peers are on different incompatible chains.

An update to the Spec should help isolate these cases, so that peers disconnect immediately and can determine the chain split before hitting sync logic. See here: https://github.com/ethereum/eth2.0-specs/pull/1652

I haven't looked into the reason as to why paul's connecting node has a different genesis block than those it's connecting to via the ENR.

paulhauner commented 4 years ago

Agreed @agemanning. I believe the different root will be due to the genesis ~~root~~ delay change mentioned previously. I’ll reflect this change and see if I can sync the chain.

q9f commented 4 years ago

I added a Teku node which is able to keep the network connected between the local and remote host even after the network fragmentation, like this:

+--------------------------+          +----------------+                     
| A1 |   Lighthouse        |          |      /-  B1    | Lighthouse          
|  |  \  (2 peers)         |          |    /-     |    | (2 peers)           
|  |  |                    |          |  /-       |    |                     
|  |   \                   |          |/-         |    |                     
|  /    \                  |         /-           /    |                     
| |      \                 |       /- |          |     |                     
| |      |                 |     /-   |          |     |                     
| |       \                |   /-     |          |     |                     
| |   Lighthouse           | /-       |      --- B2    | Lighthouse          
| A2  (2 peers)            /-         |  ---/          | (2 peers)           
|  --       \            /-|         ---/              |                     
|    \-      \         /-  |     ---/ |                |                     
|      \-     \      /-    | ---/     |                |                     
|        \-   |    /-    ---/         |                |                     
|          \-  \ --  ---/  |          |                |                     
|            \- C1--/      |          |                |                     
|                          |          |                |                     
|             Teku         |          |                |                     
|            (4 peers)     |          |                |                     
|                          |          |                |                     
|                          |          |                |                     
|  localhost               |          |  remotehost    |                     
+--------------------------+          +----------------+

I'm not aware of a different genesis root. I'll updload the config to Github so we all are on the same genesis: https://github.com/goerli/schlesi

Edit, having second thoughts about the genesis delay. How does this even work? If each nodes picks a random time between MIN_GENESIS_DELAY and MIN_GENESIS_DELAY * 2 how can you be certain about the correct genesis event?

paulhauner commented 4 years ago

I'm not aware of a different genesis root. I'll updload the config to Github so we all are on the same genesis: https://github.com/goerli/schlesi

Sorry this was a typo, I should have said "genesis delay" not "genesis root". Since I didn't originally set the genesis delay to 160000 my genesis state had a different genesis time.

Edit, having second thoughts about the genesis delay. How does this even work? If each nodes picks a random time between MIN_GENESIS_DELAY and MIN_GENESIS_DELAY * 2 how can you be certain about the correct genesis event?

Instead of selecting randomly, we do this:

genesis_time = eth1_timestamp - eth1_timestamp % MIN_GENESIS_DELAY + 2 * MIN_GENESIS_DELAY

So, if MIN_GENESIS_DELAY is 24hrs, we wait until the next midnight (UTC) that is at least 24hrs away. So an eth1 timestamp at midday Tuesday ends up with an eth2 genesis on Wednesday/Thursday midnight.

q9f commented 4 years ago

This situation is getting gradually worse over time. During the weekend, I wasn't able to do the manual recovers of the remote hosts and the chain split became more solidified.

I decided to add more validators to the well connected remote hosts, but it was too late. The overall network finalization is stuck 300 epochs ago. I don't know how to recover from that.

When I try to get the nodes aligned and reorganized manually - and I still strongly believe this should be the task of the client - I get critical errors from my validators, like this:

Mar 30 23:10:32.641 CRIT Published block was invalid             slot: 66886, message: The SignedBeaconBlock could not be processed and has not been published: InvalidSignature, service: block
Mar 30 23:10:36.023 INFO Successfully published attestation      slot: 66886, committee_index: 0, head_block: 0xf72f…454f, signatures: 1, service: attestation
Mar 30 23:10:38.001 INFO All validators active                   slot: 66886, epoch: 2090, total_validators: 16, active_validators: 16, proposers: 9, service: notifier
Mar 30 23:10:44.867 CRIT Published block was invalid             slot: 66887, message: The SignedBeaconBlock could not be processed and has not been published: InvalidSignature, service: block
Mar 30 23:10:50.001 INFO All validators active                   slot: 66887, epoch: 2090, total_validators: 16, active_validators: 16, proposers: 9, service: notifier
Mar 30 23:10:57.201 CRIT Published block was invalid             slot: 66888, message: The SignedBeaconBlock could not be processed and has not been published: InvalidSignature, service: block

I'm not sure this error message makes sense, why would the signature be invalid?!?

Mar 30 23:10:24.019 INFO Attestation from local validator        slot: 66885, index: 0, source: 1790, target: 1790, service: http
Mar 30 23:10:26.001 INFO Synced to previous epoch                current_slot: 66885, head_slot: 66853, finalized_epoch: 1788, finalized_root: 0xb756…6281, peers: 5, service: slot_notifier
Mar 30 23:10:32.641 WARN Invalid block from local validator      outcome: InvalidSignature, service: http
Mar 30 23:10:36.017 INFO Attestation from local validator        slot: 66886, index: 0, source: 1790, target: 1790, service: http
Mar 30 23:10:38.001 INFO Synced to previous epoch                current_slot: 66886, head_slot: 66853, finalized_epoch: 1788, finalized_root: 0xb756…6281, peers: 5, service: slot_notifier
Mar 30 23:10:44.108 WARN Skipping more than an epoch             request_slot: 66886, head_slot: 66853, service: beacon
Mar 30 23:10:44.867 WARN Invalid block from local validator      outcome: InvalidSignature, service: http
Mar 30 23:10:50.000 INFO Synced to previous epoch                current_slot: 66887, head_slot: 66853, finalized_epoch: 1788, finalized_root: 0xb756…6281, peers: 5, service: slot_notifier
Mar 30 23:10:56.127 WARN Skipping more than an epoch             request_slot: 66887, head_slot: 66853, service: beacon
Mar 30 23:10:57.201 WARN Invalid block from local validator      outcome: InvalidSignature, service: http

Note how the current epoch is 2090 while finalized is 1788 - does the protocol allow to recover from that? I sense that this is not a client issue but a protocol issue.

paulhauner commented 4 years ago

I still strongly believe this should be the task of the client

I also agree. We have some testing around recovering from forks, however we haven't done sophisticated testing in this area yet. We're still catching up with the latest spec; fault-finding will come after that.

why would the signature be invalid?!?

I suspect this is related to https://github.com/sigp/lighthouse/issues/923. At some point (v0.9.0 IIRC) the specification removed the one-epoch look-ahead for the block proposer such that we can only know the block proposer whilst we're in the epoch in which they will propose. We were relying upon this look-ahead in several places and it's caused us a few bugs. Unfortunately the spec changed in such a way that it's very difficult for us to test against this bug.

I suspect the bug you're seeing is due to this code:

https://github.com/sigp/lighthouse/blob/93da74e7d630602efc6ccbf731128b90e28ae3a1/beacon_node/rest_api/src/validator.rs#L133-L143

Basically, when your VC asks for the proposal slot for some epoch E from the BN, the BN is only skipping to E - 1. Therefore the BN reports an invalid proposer index to the VC. Then, when the VC publishes the block, the BN is forced to skip all the way to the E and learns that the VC has the wrong proposer index.

This issue only occurs when there's been a long period since the network has seen a block produced. We don't get this condition on our testnet, so it doesn't cause issues. I have fixed this in some branch that's being merged into v0.2.0. I'll go find my fix and cherry-pick it into a PR that can be merged into testnet5.

I sense that this is not a client issue but a protocol issue.

I think this one is on us, I'm confident that the protocol could resume from this state, I've recovered from some very large ones during testing prior to the v0.9.0 change.

q9f commented 4 years ago

Ok. Thanks for investigating this. My testnet is now in a state where I can no longer recover it. So full success :tada: :smile:

I will create a new testnet soon with better distribution of the validators but will keep the bootnodes running for a couple more days. Let me know if there's anything else I can do to help.

paulhauner commented 4 years ago

My testnet is now in a state where I can no longer recover it.

Congrats! Sorry you had a bumpy experience, though.

I will create a new testnet soon with better distribution of the validators but will keep the bootnodes running for a couple more days.

We're getting to a place now where master/testnet5 is quite far from what we have in v0.2.0 and performing deep fault analysis is feeling a little redundant. That being said, there's still things to be learned and your input is valuable.

Our v0.2.0 branch is passing the simulator, which involves starting a new chain from a deposit contract in ganache-cli, onboarding validators and reaching finality. However, I'm not sure exactly when it will be ready for an actual testnet. Perhaps before you start your new testnet, reach out and see if perhaps we're ready for a v0.2.0 testnet?

Thanks again!

q9f commented 4 years ago

Sure. I'll reach out on Discord. :+1:

q9f commented 4 years ago

I can no longer observe this. Networking seems to have improved in that regard. Happy to close this.

AgeManning commented 4 years ago

:+1: