prysmaticlabs / prysm

Go implementation of Ethereum proof of stake
https://www.offchainlabs.com
GNU General Public License v3.0
3.47k stars 1k forks source link

Resource Usage Spikes (CPU & Memory) to Context Deadline Exceeded causing numerous missed attestations #8483

Closed Novop closed 3 years ago

Novop commented 3 years ago

Did not have any of these issues in Genesis plus a month after.

Now I am getting them every around 2 hours, causing several missed attestations.

I have some worrying stats in regards to the Beacon memory and cpu usage. (graphs at bottom)

Sometimes i will only get an error in the validator log.
Othertimes, I will get an error in both the validator and the beacon chain log, as below:

πŸ”₯ Error

Beacon Chain Log


Feb 20 17:26:36 ethstaker bash[739]: time="2021-02-20 17:26:36" level=warning msg="Could not update head" error="context deadline exceeded" prefix=blockchain

Validator Log


Feb 20 17:25:47 eths validator-v1.2.2[1138]: time="2021-02-20 17:25:47" level=error msg="Could not request attestation to sign at slot" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator pubKey=0xa36.. slot=584826
Feb 20 17:25:59 eths validator-v1.2.2[1138]: time="2021-02-20 17:25:59" level=error msg="Could not request attestation to sign at slot" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator pubKey=0x814.. slot=584827
Feb 20 17:26:11 eths validator-v1.2.2[1138]: time="2021-02-20 17:26:11" level=error msg="Could not submit slot signature to beacon node" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator slot=584828
Feb 20 17:26:11 eths validator-v1.2.2[1138]: time="2021-02-20 17:26:11" level=error msg="Could not request attestation to sign at slot" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator pubKey=0x98b..slot=584828

🌍 Your Environment

Operating System:

DigitalOcean VPS Ubuntu 2 Cores ( Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz ( Cache Size 4MB) ) 16GB Ram 300GB SSD**

Infura endpoint (no geth)

What is running: 1) Validator 2) Beacon 3) Prometheus 4) Node Exporter 5) Loki 6) Grafana 7) Json_exporter

What version of Prysm are you running? (Which release) 1.2.2

Anything else relevant (validator index / public key)?

Beacon File


# The eth2 beacon chain service (part of systemd)
# file: /etc/systemd/system/beacon-chain.service
[Unit]
Description=Beacon Node ETH2 Prysm Client
After=network-online.target

[Service]
User=beacon
Group=beacon
Type=simple
Restart=always
RestartSec=5
ExecStart=/bin/bash -c '/usr/local/bin/beacon-chain-v1.2.2 --config-file /var/lib/prysm/beacon/prysm-beacon.yaml'

[Install]
WantedBy=multi-user.target
Alias=beacon

Validator File


[Unit]
Description=Validator ETH2 Client
After=network-online.target

[Service]
User=validator
Group=validator
Type=simple
Restart=always
RestartSec=5
TimeoutSec=infinity
ExecStartPre=/bin/sleep 780
ExecStart=/usr/local/bin/validator-v1.2.2 --config-file /var/lib/prysm/validator/prysm-validator.yaml

[Install]
WantedBy=multi-user.target

Graphs: Memory Usage Note: the 2 sudden drops, are: 20/2 After updating to version 1.2.2 i did a restart, hence the drop 21/2 I have been getting too many missed attestations, so rebooted the system

image

CPU Usage

image
Novop commented 3 years ago

As of today, the issues continue. I have noticed the spikes in CPU usage taking place, and then the beacon being rendered unusable causing several missed attestations.

I am still unsure whether during this time, the beaconchain is being restarted or not.. In the beacon logs, I only get the 'could not update head' error log, as i indicated above. (no log of 'exit')

However, Grafana sends me an email stating that the process has been restarted, while in addition, every beacon related stats seems to go offline. for example its cpu & memory allocation, as well as every other stat i can extract from node exporter.

Here;s the spikes I have noticed today. For a few hours after the reboot, it was running fine. Then just chaos. Spikes and missed attestations every 1-2 hours.

CPU Usage (last 24hrs)

image

CPU Usage (last 6 hours)

image

Memory Usage (last 24 hours)

image
ahylick commented 3 years ago

I too am experiencing these spikes in CPU and mem as well as the missed attestations. I was not seeing these issues on v1.0.5 or 1.1.0, but from 1.2.0 onwards, I started experiencing these issues. My work around is to simply restart my services when I see the spiking start to happen.

I am currently running v1.3.0.

Novop commented 3 years ago

Hello ahylick,

We have identified the issue to be lack of CPU power. I was only running 2 cores, and since I have more than 1 validator, at times my machine would have a hard time keeping up. This was not an issue while the network was small with a few thousand nodes in the network. Nowadays however the network has grown, and CPU needs to have enough processing power to keep up. I would suggest that you double check your CPU specs.

ahylick commented 3 years ago

Ack. thanks Novop!

KevinVitale commented 3 years ago

Hello ahylick,

We have identified the issue to be lack of CPU power. I was only running 2 cores, and since I have more than 1 validator, at times my machine would have a hard time keeping up. This was not an issue while the network was small with a few thousand nodes in the network. Nowadays however the network has grown, and CPU needs to have enough processing power to keep up. I would suggest that you double check your CPU specs.

FWIW: I received 2 of these "Could not retrieve pre state [...] context deadline exceeded" errors within 10 to 25 minutes following a beacon DB backup invocation. I'm aware that's there's an issue with high memory usage when performing a backup; however, you can see that there's a few massive CPU spikes, as well.

The CPU is an AMD EPYC 24c/48t, which should be more than enough. Red line is CPU usage.

Screen Shot 2021-02-28 at 5 35 27 PM

UPDATE: for debugging purposes, I'm sharing that following the beacon DB backup, I continue to see this error coinciding with high CPU spikes despite my system having 24c/48t to work with.

Screen Shot 2021-02-28 at 10 28 49 PM

Timeline

prestonvanloon commented 3 years ago

Backup related spikes are tracked here #8486

skillful-alex commented 3 years ago

Error "Could not request attestation to sign at slot" reproduced in version 1.3.2. In this moment:

amirderakh commented 3 years ago

Can confirm similar issue. After about a week the beacon-chain.exe uses all my resources (CPU/RAM). This results in failed attestations. Restarting the client only helps for a few hours, if at all. Memory leak? A complete reboot of the machine is necessary. This happened 3 times already.

Using Prysm 1.3.3. (Same issue in 1.2.x) Windows Server 2016 Infura endpoint (no geth)

Most of the servers filesystem is compressed (relevant?)

nisdas commented 3 years ago

@amirderakh that shouldn't be happening at all. Any chance you are performing regular beacon live db backups ?

nisdas commented 3 years ago

@skillful-alex During that period, were you performing live db backups ?

amirderakh commented 3 years ago

@nisdas No I haven't started a backup and I think they aren't executed automatically. I can give you a log file if I knew the folder where they're stored.

nisdas commented 3 years ago

Can confirm similar issue. After about a week the beacon-chain.exe uses all my resources (CPU/RAM). This results in failed attestations. Restarting the client only helps for a few hours, if at all. Memory leak? A complete reboot of the machine is necessary. This happened 3 times already.

This is pretty bizarre, if restarting the client only helps for a few hours, the issue might be something else. Memory usage should be around 1.3 - 3 gb, depending on the number of validators you run.

There isn't any memory leak for windows that I am aware of, if a reboot of your machine 'fixes' it, it might be some OS related issue here. Any custom settings you are running for Windows Server 2016 ? Or is it a vanilla build.

amirderakh commented 3 years ago

Yes I thought the same. It's just that after closing and opening the beacon-chain.exe after such event, it takes easily 5 to 10 minutes till it's ready again and beacon-chain.exe uses all CPU and RAM in that time and afterwards. Then I need to reboot. So not sure what to check in the Windows installation, apart from hardened security it has no special features. I've already decompressed the beaconchain.db but it did nothing. Also made exclusions from virus scanning.

skillful-alex commented 3 years ago

No, this "deadline exceeded" is not related to backups.

And I don't see the leak. The node has been working for almost 2 weeks without a reboot, and missed attestations happen from time to time. image But if you zoom in at the β€œfailed” moment, you will see 100% CPU utilization.

nisdas commented 3 years ago

@amirderakh oh that is very wierd indeed. A shut down should at most take 10 seconds, a few minutes might be indicative of something else.

@skillful-alex By any chance are you running a significant amount of validators ? There might be issues with processing all attestations in a subnet, and you are subscribed to multiple ones.

nisdas commented 3 years ago

Ah k, in that case the load of supporting all those subnets must be quite abit. 4 cores might not be enough to support more than 30 subnets, which would lead to the spikes that you do see. As the network grows bigger, it becomes a bit more intensive to support all the subnets for a node. You could try to see if adding more cores will help here.

amirderakh commented 3 years ago

@nisdas You're right it seems my issue is different from this thread. However, I saw someone on Reddit with the same issue. My setup is not too unique, I'm sure there are more people encountering this.

Today I'm bringing log files. Generally: At one point the beacon-chain.exe is using more and more CPU till the system becomes unusable. I saw it writing huge amounts of data in that situation. In the screenshot it wrote 2GB in 3 hours. Then attestations fail. Restarting the process only helps for a few hours. Only restarting the whole machine lets Prysm run stable again (for some days). I disabled file compression and virus check.

b

[2021-03-25 09:07:16] ERROR validator: Could not check if validators are exited error=rpc error: code = Unavailable desc = Syncing to latest head, not ready to respond [2021-03-25 09:07:19] ERROR validator: rpc error: code = DeadlineExceeded desc = context deadline exceeded [2021-03-25 09:07:19] ERROR validator: Failed to update assignments error=rpc error: code = DeadlineExceeded desc = context deadline exceeded [2021-03-25 09:07:41] ERROR validator: Could not check if validators are exited error=rpc error: code = Unavailable desc = Syncing to latest head, not ready to respond

[2021-03-25 08:59:26] INFO blockchain: Finished applying state transition attestations=77 slot=819807 [2021-03-25 08:59:26] INFO initial-sync: Processing block 0x4e947bc6... 819808/819895 - estimated time remaining 14m30s blocksPerSecond=0.1 peers=47 [2021-03-25 08:59:51] INFO p2p: Peer summary activePeers=45 inbound=0 outbound=45 [2021-03-25 09:00:52] INFO blockchain: Synced new block block=0x4e947bc6... epoch=25619 finalizedEpoch=25617 finalizedRoot=0x31564de0... slot=819808 slotInEpoch=0

Prysm 1.3.4 on Windows 2016 on VPS with 4 cores & 16GB memory.

nisdas commented 3 years ago

The Syncing speed is extremely slow, I am trying to wonder if its an issue with your disk i/o. Usually syncing isn't disk i/o bound but if there is significant contention then there might be issues as you do see. Can you try running with only the beacon node running with only the beacon-chain running without any other applications running. Ex: Sqlserver, JForex, etc. @amirderakh .

issanyo commented 3 years ago

This is happening for me as well with latest version (v1.3.8-hotfix+6c0942), this is an extract of the validator log:

223e="2021-04-30 03:52:33" level=error msg="Could not submit slot signature to beacon node" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator slot=1076959
time="2021-04-30 03:52:33" level=error msg="Could not submit attestation to beacon node" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator pubKey=0x847ab144c136 slot=10769594-30 03:52:33" level=error msg="Could not report validator's rewards/penalties" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator slot=1076959
time="2021-04-30 03:52:53" level=info msg="Attestation schedule" attesterDutiesAtSlot=1 prefix=validator pubKeys=[0x847ab144c136] slot=1076968 slotInEpoch=8 totalAttestersInEpoch=1
time="2021-04-30 03:54:03" level=info msg="Submitted new attestations" AggregatorIndices=[] AttesterIndices=[7693] BeaconBlockRoot=0x3c06c342f38d CommitteeIndex=27 Slot=1076968 SourceEpoch=33654 SourceRoot=0x3e3bfd5bab74 TargetEpoch=33655 TargetRoot=0x604a8245353c prefix=validatorrectlyVotedHead=false correctlyVotedSource=false correctlyVotedTarget=false epoch=33654 inclusionDistance=18446744073709551615 inclusionSlot=18446744073709551615 newBalance=33.417738576 oldBalance=33.41776275 percentChange="-0.00007%" percentChangeSinceStart="0.01523%" prefix=validator pubKey=0x847ab144c136 startBalance=33.412649223e="2021-04-30 04:01:37" level=error msg="Could not request attestation to sign at slot" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator pubKey=0x847ab144c136 slot=107699830 04:05:06" level=info msg="Previous epoch voting summary" correctlyVotedHead=true correctlyVotedSource=true correctlyVotedTarget=true epoch=33655 inclusionDistance=1 inclusionSlot=1076969 newBalance=33.417769683 oldBalance=33.417738576 percentChange="0.00009%" percentChangeSinceStart="0.01532%" prefix=validator pubKey=0x847ab144c136 startBalance=33.4126492230%" correctlyVotedTargetPct="100%" epoch=33655 prefix=validatorl=info msg="Vote summary since launch" attestationsInclusionPct="100%" averageInclusionDistance="1.01 slots" correctlyVotedHeadPct="98%" correctlyVotedSourcePct="100%" correctlyVotedTargetPct="100%" numberOfEpochs=182 pctChangeCombinedBalance="0.01523%" prefix=validatorlidator pubKeys=[0x847ab144c136] slot=1077045 slotInEpoch=21 totalAttestersInEpoch=1
time="2021-04-30 04:09:36" level=error msg="Could not request attestation to sign at slot" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator pubKey=0x847ab144c136 slot=107704530 04:11:35" level=error msg="Could not report validator's rewards/penalties" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator slot=1077055
time="2021-04-30 04:18:11" level=error msg="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator
time="2021-04-30 04:18:11" level=error msg="Failed to update assignments" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator
time="2021-04-30 04:18:15" level=error msg="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator
time="2021-04-30 04:18:15" level=error msg="Failed to update assignments" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator
time="2021-04-30 04:18:15" level=error msg="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator
time="2021-04-30 04:18:15" level=error msg="Failed to update assignments" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator
time="2021-04-30 04:18:15" level=error msg="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator
time="2021-04-30 04:18:15" level=error msg="Failed to update assignments" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=validator
...

The beacon node is on the same machine, I do have a plenty of cpu and ram but I noticed some unsual cpu activity on the validaotr when this started happening:

image

after a restart this returned to validate correctly, but it went down for that period of time.

julian-st commented 3 years ago

I think I have a similar error, maybe once a day since upgrading to 1.3.8 I am now on the hotfix. The log for the beacon-node looks like this

time="2021-04-30 15:32:16" level=info msg="Peer summary" activePeers=81 inbound=66 outbound=15 prefix=p2p Apr 30 15:32:33 julian prysm.sh[2057]: time="2021-04-30 15:32:33" level=warning msg="Could not update head" error="context deadline exceeded" prefix=blockchain Apr 30 15:32:33 julian prysm.sh[2057]: time="2021-04-30 15:32:33" level=info msg="Synced new block" block=0x977e8b33... epoch=33764 finalizedEpoch=33762 finalizedRoot=0x7067606d... prefix=blockchain

The validator log looks like this

Apr 30 15:31:32 julian prysm.sh[2058]: time="2021-04-30 15:31:32" level=warning msg="Noo attestations to aggregate" error="rpc error: code = NotFound desc = Could not find attestation for slot and committee in po Apr 30 15:31:35 julian prysm.sh[2058]: time="2021-04-30 15:31:35" level=error msg="Could not submit attestation to beacon node" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" prefix=> Apr 30 15:32:39 julian prysm.sh[2058]: time="2021-04-30 15:32:39" level=info msg="Submitted new attestations" AggregatorIndices=[] AttesterI>

Afterwards I have the process mount.exfat that consumes 100 % cpu and basically just hogs one cpu. Maybe this is a problem with my SSD. I have it formatted with fat, but it ran without problems before 1.3.8

rauljordan commented 3 years ago

What's the status on this one @nisdas ?

sezu commented 3 years ago

What's the status on this one @nisdas ? I too was plagued with Context Deadline Exceeded issues, but they've entirely disappeared in the current version (v1.4.4). Good job! 🎊 . I haven't had time to pin point exactly what version fixed them, but feel huge relief knowing I won't miss anymore block proposals :)