ssvlabs / ssv

Secret-Shared-Validator(SSV) for ethereum staking
https://ssv.network
GNU General Public License v3.0
181 stars 93 forks source link

reorg: error - node is not healthy - context deadline exceeded #1327

Open ThomasBlock opened 7 months ago

ThomasBlock commented 7 months ago

Describe the bug I am experimenting with different SSV setups. most are working fine. But this configuration crashes several times a day. It reboots, but would be nice to avoid these completely..

To Reproduce ubuntu22 ethdocker geth nimbus-cl-only SSV-Node:v1.2.3

Logs

execution-1  | INFO [02-21|16:11:34.196] Chain reorg detected                     number=19,276,837 hash=16adfc..de9b88 drop=1 dropfrom=f2211e..a3d250 add=1 addfrom=3c7661..7af515
execution-1  | INFO [02-21|16:11:34.300] Chain head was updated                   number=19,276,838 hash=3c7661..7af515 root=04cfd6..bcde79 elapsed=103.333878ms
consensus-1  | INF 2024-02-21 16:11:21.158+01:00 Slot end                                   topics="beacnde" slot=8475354 nextActionWait=n/a nextAttestationSlot=-1 nextProposalSlot=-1 syncCommitteeDuties=current head=71c706d7:8475354
consensus-1  | INF 2024-02-21 16:11:23.000+01:00 Slot start                                 topics="beacnde" head=71c706d7:8475354 delay=93us530ns finalized=264852:4dc8c933 peers=49 slot=8475355 sync=synced epoch=264854
consensus-1  | INF 2024-02-21 16:11:33.470+01:00 State replayed                             topics="chaindag" blocks=25 slots=27 current=71c706d7:8475354@8475355 ancestor=7e7ee1df:8475327@8475328 target=fd3d9e94:8475353@8475355 ancestorStateRoot=c6d1174e targetStateRoot=58af0665 found=false assignDur=680ms106us201ns replayDur=6s746ms924us5ns
consensus-1  | NTC 2024-02-21 16:11:34.060+01:00 Updated head block with chain reorg        topics="chaindag" headParent=fd3d9e94:8475353 stateRoot=20fe506f justified=264853:9afd5506 finalized=264852:4dc8c933 isOptHead=false newHead=f703c6d5:8475355 lastHead=71c706d7:8475354
consensus-1  | INF 2024-02-21 16:11:34.062+01:00 Missed multiple heartbeats                 topics="libp2p gossipsub" heartbeat=GossipSub delay=6s937ms303us522ns hinterval=700ms
consensus-1  | INF 2024-02-21 16:11:34.194+01:00 Slot end                                   topics="beacnde" slot=8475355 nextActionWait=n/a nextAttestationSlot=-1 nextProposalSlot=-1 syncCommitteeDuties=current head=f703c6d5:8475355

ssv-node-1  | 2024-02-21T15:11:33.357383Z       error   node is not healthy     {"node": "consensus client", "error": "failed to request syncing: failed to call GET endpoint: Get \"http://consensus:5052/eth/v1/node/syncing\": context deadline exceeded", "errorVerbose": "Get \"http://consensus:5052/eth/v1/node/syncing\": context deadline exceeded\nfailed to call GET endpoint\ngithub.com/attestantio/go-eth2-client/http.(*Service).get\n\t/go/pkg/mod/github.com/attestantio/go-eth2-client@v0.16.3/http/http.go:66\ngithub.com/attestantio/go-eth2-client/http.(*Service).NodeSyncing\n\t/go/pkg/mod/github.com/attestantio/go-eth2-client@v0.16.3/http/nodesyncing.go:30\ngithub.com/bloxapp/ssv/beacon/goclient.(*goClient).Healthy\n\t/go/src/github.com/bloxapp/ssv/beacon/goclient/goclient.go:203\ngithub.com/bloxapp/ssv/nodeprobe.(*Prober).probe.func1\n\t/go/src/github.com/bloxapp/ssv/nodeprobe/nodeprobe.go:96\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\nfailed to request syncing\ngithub.com/attestantio/go-eth2-client/http.(*Service).NodeSyncing\n\t/go/pkg/mod/github.com/attestantio/go-eth2-client@v0.16.3/http/nodesyncing.go:32\ngithub.com/bloxapp/ssv/beacon/goclient.(*goClient).Healthy\n\t/go/src/github.com/bloxapp/ssv/beacon/goclient/goclient.go:203\ngithub.com/bloxapp/ssv/nodeprobe.(*Prober).probe.func1\n\t/go/src/github.com/bloxapp/ssv/nodeprobe/nodeprobe.go:96\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"}
ssv-node-1  | 2024-02-21T15:11:33.358330Z       error   not all nodes are healthy
ssv-node-1  | 2024-02-21T15:11:33.358346Z       fatal   ethereum node(s) are either out of sync or down. Ensure the nodes are healthy to resume.
ssv-node-1  | make: *** [Makefile:102: start-node] Error 1
ssv-node-1  | make: go: No such file or directory
ssv-node-1  | make: go: No such file or directory
ssv-node-1  | make: go: No such file or directory
ssv-node-1  | Build /go/bin/ssvnode
ssv-node-1  | Build /config/config.yaml
ssv-node-1  | Build 
ssv-node-1  | Command --config=/config/config.yaml
ssv-node-1  | Running node on address: *)
ssv-node-1  | 2024-02-21T15:11:36.958535Z       info    starting SSV-Node:v1.2.3-e5a6d711958f5043615bf8f6a95005a6083e714f
ThomasBlock commented 6 months ago

update on this: threee of the systems work fine. one setup with etdocker still makes problems: ssv is rebooting altough execution and consensus client are totally fine.

ssv-node-1  | {"level":"info","time":"2024-03-26T13:17:45.292192Z","name":"execution_client","msg":"fetched registry events","from_block":19518848,"to_block":19518848,"target_block":19518848,"progress":"100.00%","events":0,"took":"5.556743ms"}
ssv-node-1  | {"level":"warn","time":"2024-03-26T13:17:45.601071Z","name":"Controller","msg":"failed to update validators metadata","error":"failed to get validator data from Beacon: failed to get validators data from beacon: failed to obtain validators: failed to obtain chunk: failed to request validators: failed to call GET endpoint: Get \"http://consensus:5052/eth/v1/beacon/states/head/validators?id=0x83d179a1f091fb06
.... 
context deadline exceeded\nfailed to get validators data from beacon\ngithub.com/bloxapp/ssv/protocol/v2/blockchain/beacon.FetchValidatorsMetadata\n\t/go/src/github.com/bloxapp/ssv/protocol/v2/blockchain/beacon/validator_metadata.go:113\ngithub.com/bloxapp/ssv/protocol/v2/blockchain/beacon.UpdateValidatorsMetadata\n\t/go/src/github.com/bloxapp/ssv/protocol/v2/blockchain/beacon/validator_metadata.go:71\ngithub.com/bloxapp/ssv/operator/validator.(*controller).UpdateValidatorMetaDataLoop\n\t/go/src/github.com/bloxapp/ssv/operator/validator/controller.go:858\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598\nfailed to get validator data from Beacon\ngithub.com/bloxapp/ssv/protocol/v2/blockchain/beacon.UpdateValidatorsMetadata\n\t/go/src/github.com/bloxapp/ssv/protocol/v2/blockchain/beacon/validator_metadata.go:73\ngithub.com/bloxapp/ssv/operator/validator.(*controller).UpdateValidatorMetaDataLoop\n\t/go/src/github.com/bloxapp/ssv/operator/validator/controller.go:858\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598"}
ssv-node-1  | {"level":"info","time":"2024-03-26T13:18:07.316673Z","name":"P2PNetwork","msg":"Verified handshake nodeinfo","selfPeer":"16Uiu2HAmVd4pPhEMR5RnAboqftLzuqZHiZy1CzLpdRP9qbFsoWxh","peer_id":"16Uiu2HAm6bqkkkkKpnHqzgrxjmJ57mNCe9Ph4MN7LdhkPedKG77h","peer_id":"16Uiu2HAm6bqkkkkKpnHqzgrxjmJ57mNCe9Ph4MN7LdhkPedKG77h","metadata":{"NodeVersion":"v1.3.2-97d20e67d83cad1fd0d8d12ff179f7a9fe090daa","ExecutionNode":"","ConsensusNode":"","Subnets":"f5ffffffffbe3ebbdbf7fffbff766c6b"},"networkID":"0x00000000"}
ssv-node-1  | {"level":"info","time":"2024-03-26T13:18:07.852493Z","name":"execution_client","msg":"fetched registry events","from_block":19518849,"to_block":19518849,"target_block":19518849,"progress":"100.00%","events":0,"took":"660.656µs"}
ssv-node-1  | {"level":"error","time":"2024-03-26T13:18:07.874067Z","msg":"node is not healthy","node":"consensus client","error":"failed to obtain node syncing status: failed to call GET endpoint: Get \"http://consensus:5052/eth/v1/node/syncing\": context deadline exceeded"}
ssv-node-1  | {"level":"error","time":"2024-03-26T13:18:07.874151Z","msg":"not all nodes are healthy"}
ssv-node-1  | {"level":"fatal","time":"2024-03-26T13:18:07.874164Z","msg":"ethereum node(s) are either out of sync or down. Ensure the nodes are healthy to resume."}
This is Eth Docker v2.8.0.0

ssvnode version v1.3.2-97d20e67d83cad1fd0d8d12ff179f7a9fe090daa

beacon-chain version Prysm/v5.0.1/a1a81d1720a0a3b850992d4825d0a023baa8e65a. Built at: 2024-03-08 20:21:37+00:00

validator version Prysm/v5.0.1/a1a81d1720a0a3b850992d4825d0a023baa8e65a. Built at: 2024-03-08 20:22:56+00:00

besu/v24.3.0/linux-x86_64/openjdk-java-17

mev-boost v1.7.1
ThomasBlock commented 3 months ago

update: was okay for a long time. 6 of 7 nodes work fine. now problems with this setup. 10 reboots of ssv-node a day bring performance down to 86 % - all while consensus and execution client are fine..

image Yellow = good node Blue= bad node

ethd version
This is Eth Docker v2.9.2.0
ssvnode version v1.3.4-39046e4aa45ab4b2d8bd48af41d62bc5858c59ad
beacon-chain version Prysm/v5.0.3/38f208d70dc95b12c08403f5c72009aaa10dfe2f. Built at: 2024-04-04 18:29:14+00:00
2024-06-16 15-09-07.7946|Nethermind starting initialization.
2024-06-16 15-09-07.8395|Client version: Nethermind/v1.26.0+0068729c/linux-x64/dotnet8.0.4

image