prysmaticlabs / prysm

Go implementation of Ethereum proof of stake
https://www.offchainlabs.com
GNU General Public License v3.0
3.46k stars 985 forks source link

Validator dies when grpc connection returns 503 #6669

Closed stefa2k closed 3 years ago

stefa2k commented 4 years ago

🐞 Bug Report

Description

I'm using haproxy to have failovers for all services including beacon-chain. All my beacon-chain nodes were unhealthy and therefore not able to serve grpc requsts, so haproxy returns http code 503 to validator's requests (expected behaviour). This causes the validator service to die (unexpected behaviour). The validator's restart can cause heavy cpu loads due to decrypting keys.

HTTP code 503 indicates the service is currently not able to serve any requests.

Solution with --grpc-retries

  1. I tried setting it to -1 but it didn't help.
  2. Setting this to 4294967295 didn't help either.

Validator service dies within 1 second of querying beacon service.

Has this worked before in a previous version?

πŸ”¬ Minimal Reproduction

Connect validator to an endpoint which returns http code 503 on requests.

πŸ”₯ Error

Validator log:

time="2020-07-21 19:02:09" level=fatal msg="Could not determine if beacon chain started: could not receive ChainStart from stream: rpc error: code = Unavailable desc = Service Unavailable: HTTP status code 503; transport: received the unexpected content-type "text/html"" prefix=validator

curl to haproxy:

$ curl -I --http2 http://localhost:14000
HTTP/1.1 503 Service Unavailable
content-length: 107
cache-control: no-cache
content-type: text/html
connection: close

🌍 Your Environment

Operating System:

  
Ubuntu 20.04
  

What version of Prysm are you running? (Which release)

  
docker tag `HEAD-367738`
  

haproxy config

##################
# beacon chain
frontend beacon_fe
        mode http

        bind *:14000 alpn h2,http/1.1

        timeout client 15s

        default_backend beacon_be

backend beacon_be
        mode http

        option httpchk GET /healthz HTTP/1.0
        http-check expect ! rstring '(?i)error'

        default-server check port 8480 proto h2 inter 1s fall 1 rise 3 init-addr none
        server srv1 beacon:4000 resolvers nocachedns
        server backup_h x.y.z.a:4000 backup
        server backup_i x.y.z.b:4000 backup
prestonvanloon commented 4 years ago

Thanks for the report! gRPC retries were not configured correctly for gRPC streams, which is what the validator uses on initial contact for the beacon node.

I've added PR #6670 which includes the following fixes and improvements:

So now the new behavior will be: "On initial contact of beacon node, open a stream to wait on the beacon node to report fully synced. If the connection is UNAVAILABLE or RESOURCE_EXHAUSTED, retry up to --grpc-retries with --grpc-retry-delay between each attempt. If all attempts have been exhausted, log a fatal condition and stop the process".

If you find that satisfactory for your use case, then please close this issue. If not, please advise a recommended solution to resolve your use case.

stefa2k commented 4 years ago

Thanks @prestonvanloon

Just a quick question, is -1 also supported?

stefa2k commented 4 years ago

I run docker tag HEAD-62ae22 now and it looks like setting --grpc-retries to 18446744073709551615 in my configs doesn't have any effect, the logs still look the same:

time="2020-07-22 17:21:40" level=info msg="Waiting for beacon chain start log from the ETH 1.0 deposit contract" prefix=validator
validator     | time="2020-07-22 17:21:44" level=fatal msg="Could not determine if beacon chain started: could not receive ChainStart from stream: rpc error: code = Unavailable desc = Service Unavailable: HTTP status code 503; transport: received the unexpected content-type "text/html"" prefix=validator
validator     | time="2020-07-22T17:21:47Z" level=info msg="New calculated roughtime offset is -38419284 ns" prefix=roughtime
validator     | time="2020-07-22 17:21:47" level=warning msg="Enabled validator slashing protection." prefix=flags
validator     | time="2020-07-22 17:21:47" level=warning msg="Enabled validator attestation and block slashing protection using an external slasher." prefix=flags
prestonvanloon commented 4 years ago

Can you try a different value other than max uint? Can you please attach logs at debug level?

When I tried it with --grpc-retries=10000 and --verbosity=debug, i could see the following logs to indicate that it was retrying.

[2020-07-22 14:20:13] DEBUG gRPC stream started. backend=[] method=/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart
[2020-07-22 14:20:14] DEBUG gRPC stream started. backend=[] method=/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart
[2020-07-22 14:20:15] DEBUG gRPC stream started. backend=[] method=/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart
[2020-07-22 14:20:16] DEBUG gRPC stream started. backend=[] method=/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart
[2020-07-22 14:20:17] DEBUG gRPC stream started. backend=[] method=/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart
[2020-07-22 14:20:18] DEBUG gRPC stream started. backend=[] method=/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart
[2020-07-22 14:20:19] DEBUG gRPC stream started. backend=[] method=/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart
[2020-07-22 14:20:20] DEBUG gRPC stream started. backend=[] method=/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart
[2020-07-22 14:20:21] DEBUG gRPC stream started. backend=[] method=/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart
[2020-07-22 14:20:22] DEBUG gRPC stream started. backend=[] method=/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart
[2020-07-22 14:20:23] DEBUG gRPC stream started. backend=[] method=/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart
[2020-07-22 14:20:24] DEBUG gRPC stream started. backend=[] method=/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart
[2020-07-22 14:20:25] DEBUG gRPC stream started. backend=[] method=/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart
[2020-07-22 14:20:26] DEBUG gRPC stream started. backend=[] method=/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart

There may be an issue with reading from a config file, but that would be separate from this issue.

stefa2k commented 4 years ago

Tried it with the params you mentioned (10000 and debug) and got the following:

validator     | time="2020-07-23 10:43:48" level=info msg="Checking DB" databasePath="/home/.eth2" prefix=node
validator     | time="2020-07-23 10:43:48" level=info msg="Starting validator node" prefix=node version="Prysm/v1.0.0-alpha.16/62ae22e3f953a1800b0018cd6eb499021f0af917. Built at: 2020-07-22 16:24:11+00:00"
validator     | time="2020-07-23 10:43:48" level=debug msg="Starting 3 services: [*prometheus.Service *slashingprotection.Service *client.ValidatorService]" prefix=registry
validator     | time="2020-07-23 10:43:48" level=debug msg="Starting service type *prometheus.Service" prefix=registry
validator     | time="2020-07-23 10:43:48" level=debug msg="Starting service type *slashingprotection.Service" prefix=registry
validator     | time="2020-07-23 10:43:48" level=debug msg="Starting service type *client.ValidatorService" prefix=registry
validator     | time="2020-07-23 10:43:48" level=warning msg="You are using an insecure slasher gRPC connection! Please provide a certificate and key to use a secure connection."
validator     | time="2020-07-23 10:43:48" level=debug msg="Successfully started slasher gRPC connection"
validator     | time="2020-07-23 10:43:48" level=debug msg="Starting prometheus service" address="0.0.0.0:8081" prefix=prometheus
validator     | time="2020-07-23 10:43:48" level=warning msg="You are using an insecure gRPC connection. If you are running your beacon node and validator on the same machines, you can ignore this message. If you want to know how to enable secure connections, see: https://docs.prylabs.network/docs/prysm-usage/secure-grpc" prefix=validator
validator     | time="2020-07-23 10:43:48" level=debug msg="gRPC stream started." backend=[] method="/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart"
validator     | time="2020-07-23 10:43:48" level=info msg="Waiting for beacon chain start log from the ETH 1.0 deposit contract" prefix=validator
validator     | time="2020-07-23 10:43:49" level=debug msg="gRPC stream started." backend=[] method="/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart"
validator     | time="2020-07-23 10:43:50" level=debug msg="gRPC stream started." backend=[] method="/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart"
validator     | time="2020-07-23 10:43:51" level=debug msg="gRPC stream started." backend=[] method="/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart"
validator     | time="2020-07-23 10:43:52" level=debug msg="gRPC stream started." backend=[] method="/ethereum.eth.v1alpha1.BeaconNodeValidator/WaitForChainStart"
validator     | time="2020-07-23 10:43:52" level=fatal msg="Could not determine if beacon chain started: could not receive ChainStart from stream: rpc error: code = Unavailable desc = Service Unavailable: HTTP status code 503; transport: received the unexpected content-type "text/html"" prefix=validator

It looks like its not read off of the config file. But every other setting (e. g. verbosity: debug) gets read correctly.

rauljordan commented 4 years ago

@prestonvanloon please triage

rauljordan commented 3 years ago

@stefa2k this actually seems related to #7339

rauljordan commented 3 years ago

This is since resolved in v1.2.0