sigp / lighthouse

Ethereum consensus client in Rust
https://lighthouse.sigmaprime.io/
Apache License 2.0
2.89k stars 736 forks source link

Fallback eth1 node and beacon node #1883

Closed AgeManning closed 3 years ago

AgeManning commented 3 years ago

It would be nice to add CLI flags for a fallback eth1 node and for the VC a fallback BN.

The logic would be that if the designated eth1 node doesn't work (i.e a local eth1 node) we attempt to connect to the fallback eth1 node.

Similarly for the VC have a backup BN if the http connection fails.

blacktemplar commented 3 years ago

I am currently working on the fallback eth1 node and there are two approaches we could go with that:

In both situations we specify a list of eth1 endpoints (used in the specified order as fallback options).

  1. For all calls to the eth1 node if the call to the first endpoint fails try all the endpoints consecutively until one returns with success or all got tried.
  2. We use a dynamic eth1 endpoint, that is first initiated with the first endpoint of the list. When doing regular updates (do_updates) we try to access all eth1 endpoints (can be done in parallel) and check if their sync states differ and if they differ for multiple update calls we change the dynamic endpoint to the endpoint that is synced the furthest.

The first approach is clearly more "live" but could result in slower answers if some of the higher priority endpoints are all down for a longer time. Furthermore, the first approach cannot react on the situation if the first endpoint is online but stuck syncing or something like that.

On the other hand the second approach has a slower update function (needs to call all the endpoints) and reacts slower to an endpoint going offline. So in the end the question is how fast we want to switch to the fallback endpoint. I think for eth1 endpoints we have enough reaction time to do approach 2 since it shouldn't matter if the eth1 node is offline for a short time.

pawanjay176 commented 3 years ago

first endpoint is online but stuck syncing

I don't think this is a concern since we immediately return RemoteNotSynced error if the eth1 node is not synced upto the ETH1_FOLLOW_DISTANCE . So we can instantly move to fallback node if the higher priority node falls out of sync.

My vote is for the first approach. It's much simpler imo and more in the lines of the other node(s) being a backup.

AgeManning commented 3 years ago

I agree with @pawanjay176

It sounds like the simpler approach and we don't have to worry about potential race conditions if a backup node gets a block faster than the main node and appears "more synced".

It sounds like 1. is the simpler and easier solution to implement and intuitively will fit with what a user expects with like a CLI flag named something like --fallback-eth1-endpoint.

blacktemplar commented 3 years ago

ok then I will continue with variant 1. I am glad you said that since I basically did already implement it and then came up with variant 2 ;).

blacktemplar commented 3 years ago

first endpoint is online but stuck syncing

I don't think this is a concern since we immediately return RemoteNotSynced error if the eth1 node is not synced upto the ETH1_FOLLOW_DISTANCE . So we can instantly move to fallback node if the higher priority node falls out of sync.

My vote is for the first approach. It's much simpler imo and more in the lines of the other node(s) being a backup.

I am currently writing tests for my fallback implementation and I think we cannot detect a node that is stuck without comparing to another node or asserting some chain liveness. The only case when we return a RemoteNotSynced error is if a remote went backwards more than ETH1_FOLLOW_DISTANCE. So if a remote just stops reporting new blocks we will never detect that and the fallbacks will not be used in this case. The only option to detect that would be to periodically compare with the fallback nodes or by assuming a minimum block issuance rate of the eth1 chain.

I think additionally to the current solution (basically variant 2 above but with a local cache needed to avoid checking network id and chain id before each single request) it would be quite easy to have some longer-term cache that stores which endpoints are not synced. This cache could then be updated (by comparing the heads of all given eth1 endpoints) in some longer interval like ETH1_FOLLOW_DISTANCE * SECONDS_PER_ETH1_BLOCK / 2. This should be enough and should not lead to a lot of unnecessary requests...

AgeManning commented 3 years ago

Yeah this is a good point.

We probably do want to check against the nodes available to see which is the most up to sync. I don't want to over complicate anything, but I would be concerned with the case that one of the fallbacks is faulty (similar to the roughtime issue Prysm had).

I would imagine a lot of users would set a central node (like Infura) as a fallback rather than running many local nodes. And in the case that a single central node is faulty (claims to have a very high block number) it would be bad that every lh node uses faulty eth1 data because their centralised fallback is faulty.

Do you think we should be concerned about these kinds of edge cases?

blacktemplar commented 3 years ago

hmmm... thats a good point. Since we need to trust the system time anyways we could do this comparison only if the timestamp of the latest block of the first endpoint is in the past more than ETH1_FOLLOW_DISTANCE * SECONDS_PER_ETH1_BLOCK / 2 or something like that. When this happens I would say the safest would be downloading the latest block from each endpoint and comparing totalDifficulty and the timestamp of the blocks. Having a faulty latest block seems to be impossible except for a malicious actor ... I will prepare a simple solution for that in the PR and then we can discuss details if that is needed.